Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
[PATCH v10 00/28] Issue 80: Unicode support on Windows
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  Messages 101 - 108 of 108 - Collapse all  -  Translate all to Translated (View all originals) < Older 
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Konstantin Khomoutov  
View profile  
 More options Dec 7 2011, 7:00 am
From: Konstantin Khomoutov <flatw...@users.sourceforge.net>
Date: Wed, 7 Dec 2011 16:00:22 +0400
Local: Wed, Dec 7 2011 7:00 am
Subject: Re: [msysGit] [PATCH v10 00/28] Issue 80: Unicode support on Windows
On Wed, 07 Dec 2011 11:42:43 +0900

Atsushi Nakagawa <at...@chejz.com> wrote:
> > I think that currently the only way to reliably get UTF-8 in commit
> > messages when using Vim is to set core.editor (or the GIT_EDITOR
> > environment variable) to
> > vim.exe -c "set enc=utf-8"
> > or
> > vim.exe -c "set fenc=utf-8"
> > (I think the former is slightly more logical these days).

> I think the former would also require a "set tenc=ANSI" (where ANSI is
> the specific ANSI codepage (ACP), as per Microsoft's definition, of
> the system).

I think this is incorrect.

Note that one of the infamous properties of the Windows console and how
"console" programs are supposed to interact with it is that (at least by
default) the Windows console uses two different encodings: one for the
output and another one for everything else--command-line arguments and
file I/O.  That is, any output function from stdio.h used in your
typical console program is supposed to output text encoded in "OEM"
encoding when it writes to the console stream, and use "ANSI" encoding
when it writes files and interprets command-line arguments.

You can try creating a text file and printing its contents in the
console window by running `type FILENAME`.  You'll notice that the
output will be readable only if the text is composed in the relevant
"OEM" encoding *unless* the console code page has been changed (using
the chcp command for instance).

As I understand, Git for Windows uses WriteConsoleW() to bypass normal
stdio mechanics.  This is done to output Unicode and let the console
subsystem take care of the rest.  But we do not control Vim, and for it
to use the "OEM" tenc appears to be a sensible idea.

As to what Vim does, f I run
vim -c "set enc=utf-8"
for a stock Vim 7.3 in a cmd.exe windows on my Russian Windows XP,
in response to
:set enc? tenc? fenc?
I get
  encoding=utf-8
  termencoding=cp866
  fileencoding=
Which lists the expected termnial encoding (the default "OEM"
code page for my flavor of Windows).
For the Vim packaged with Git (when I run it from the Git's bash)
the result is the same.

I think we should just not touch tenc at all as Vim seems to do the
right thing itself.  See above.

> Another consideration is that the former doesn't work properly here (I
> can't input native characters with that setting because vim goes out
> of INSERT mode when I try).  A short discussion at [1] says that vim
> is flaky in unicode mode so maybe it's related to that.

[...]

Works here for Cyrillic.  From your name and Windows codepage I assume
you're inputting Japanese.  I heard the input methods (as well as
codepages) used for it are way more complicated than for roman/cyrillic
alphabets.  May be this is the issue here.

> Anyways, I propose the latter approach and will put together a patch
> as soon as I've overcome some technical and time constraint hurdles.
> Maybe I'll post an RFC of what I'm thinking of doing before that
> because I'm on an network where I can't checkout external
> repositories.

> > Messing with defaults is dangerous because on Windows you usually
> > expect test editors to write out files in ANSI code page.

> It's could be best if it can be limited to COMMIT_EDITMSG, but since
> that's a technical challenge in itself, I'm thinking it might be okay
> if the defaults were changed via "C:\Program Files\Git\share\vim
> \vimrc" because that's under "Git for Windows"s install path.

I wonder how this will play for the user with specific requirements like
using certain i18n.commitencoding.  I don't know if there are any,
though.

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Johannes Schindelin  
View profile  
 More options Dec 7 2011, 11:21 am
From: Johannes Schindelin <Johannes.Schinde...@gmx.de>
Date: Wed, 7 Dec 2011 17:21:22 +0100 (CET)
Local: Wed, Dec 7 2011 11:21 am
Subject: Re: [msysGit] [PATCH v10 00/28] Issue 80: Unicode support on Windows
Hi,

On Wed, 7 Dec 2011, Atsushi Nakagawa wrote:
> [... a long mail discussing vim encodings and then sent a patch to the
> issue tracker that did not apply...]

I tried to clean up the patch and pushed the result as an/vim-utf-8. Since
that discussion was a bit too long for my attention span I am unsure
whether this should be merged as-is.

Ciao,
Dscho


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Atsushi Nakagawa  
View profile  
 More options Dec 7 2011, 7:38 pm
From: Atsushi Nakagawa <at...@chejz.com>
Date: Thu, 08 Dec 2011 09:38:04 +0900
Local: Wed, Dec 7 2011 7:38 pm
Subject: Re: [msysGit] [PATCH v10 00/28] Issue 80: Unicode support on Windows
On Wed, 7 Dec 2011 17:21:22 +0100 (CET)

Johannes Schindelin <Johannes.Schinde...@gmx.de> wrote:
> On Wed, 7 Dec 2011, Atsushi Nakagawa wrote:
> > [... a long mail discussing vim encodings and then sent a patch to the
> > issue tracker that did not apply...]

> I tried to clean up the patch and pushed the result as an/vim-utf-8.

Thanks Johannes!  I was pondering whether I should try make that Google
Code editor patch git-compatible by hand or if I should just wait till
the weekends when I'm at home.  I've looked over your clean-up changes
(wrapping and the first word).

Regards,

--
Atsushi Nakagawa
<at...@chejz.com>
Changes are made when there is inconvenience.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Atsushi Nakagawa  
View profile  
 More options Dec 7 2011, 9:28 pm
From: Atsushi Nakagawa <at...@chejz.com>
Date: Thu, 08 Dec 2011 11:28:05 +0900
Local: Wed, Dec 7 2011 9:28 pm
Subject: Re: [msysGit] [PATCH v10 00/28] Issue 80: Unicode support on Windows
On Wed, 7 Dec 2011 16:00:22 +0400

I've summarized the details to this reasoning to make sure I've
correctly understood:

> [ The code page for `tenc' should be "OEM" and not "ANSI".
>   (Background information and examples followed.) ]

Thanks for that!  I didn't realize there was a distinction between the
two and used "ANSI" to mean both.  (A quick look at Wikipedia's "Code
page" article states (only?) Japanese, Chinese and Korean systems share
the same for both, so the distinction seems to hold elsewhere.)

Nevertheless, I only used the word "ANSI" because that's what I assumed
it was when I saw vim set `enc' to "cp932".  My point about requiring
`tenc' to be set will follow in response to below...

> [ `tenc' does not need to be set because vim already does the right
>   thing and sets it to the OEM codepage like so: ]

> ..., if I run
> vim -c "set enc=utf-8"
> for a stock Vim 7.3 in a cmd.exe windows on my Russian Windows XP,
> in response to
> :set enc? tenc? fenc?
> I get
>   encoding=utf-8
>   termencoding=cp866
>   fileencoding=

This is interesting indeed.  Because I get in response to the same
sequence:
  encoding=utf-8
  termencoding=
  fileencoding=

and if run without the `-c' argument:
  encoding=cp932
  termencoding=
  fileencoding=

So yes, vim does the right thing in controlling `enc' and `tenc' so that
(if we don't change anything) `tenc' and `fenc' will effectively and
correctly be OEM and ANSI respectively.

However, the moment we change `enc', we get a case of varying mileage,
and on a Japanese system `tenc' is required to be explicitly "set back"
to OEM (cp932).

I think this is all the more reason to choose to your latter approach of
only changing `fenc'.  In the patch that Johannes kindly moved to
"an/vim-utf-8", this is attempted by setting just `fencs'.

> I think we should just not touch tenc at all as Vim seems to do the
> right thing itself.  See above.

I agree.  The question is then whether we avoid touching `enc' because
vim will set `tenc' to nothing on certain systems making it depend on
`enc'.

> > Another consideration is that the former doesn't work properly here (I
> > can't input native characters with that setting because vim goes out
> > of INSERT mode when I try).  A short discussion at [1] says that vim
> > is flaky in unicode mode so maybe it's related to that.
> [...]

> Works here for Cyrillic.  From your name and Windows codepage I assume
> you're inputting Japanese.  I heard the input methods (as well as
> codepages) used for it are way more complicated than for roman/cyrillic
> alphabets.  May be this is the issue here.

The use of IME could be related, but the same scenario works if `enc'
remains "cp932".  I dunno, maybe something was overlooked in vim's input
logic.

> > Anyways, I propose the latter approach and will put together a patch
> > as soon as [...]

> > > Messing with defaults is dangerous because on Windows you usually
> > > expect test editors to write out files in ANSI code page.

> > It's could be best if it can be limited to COMMIT_EDITMSG, but since
> > that's a technical challenge in itself, I'm thinking it might be okay
> > if the defaults were changed via "C:\Program Files\Git\share\vim
> > \vimrc" because that's under "Git for Windows"s install path.

On a side note, I did end up limiting it to COMMIT_EDITMSG..

> I wonder how this will play for the user with specific requirements like
> using certain i18n.commitencoding.  I don't know if there are any,
> though.

I did some quick tests...
$ git init
$ git config i18n.commitencoding cp932
$ git commit --allow-empty
  (input a japanese message with fenc=utf-8)
$ git log
  (I see the message correctly)
$ git commit --amend --allow-empty
  (I see the message correctly and fenc is utf-8)

$ git config i18n.commitencoding ascii
$ git commit --allow-empty
  (input a japanese message with fenc=utf-8)
$ git log
  (I see the message correctly)
$ git commit --amend --allow-empty
  (I see the message correctly and fenc is utf-8)

At this point, I don't know what's going on.

Regardless, I think there're three possibilities:
1, COMMIT_EDITMSG, the interface between the commit log message and
EDITOR, is utf-8 regardless of i18n.commitencoding.  In which case
everything's good.
2, COMMIT_EDITMSG's encoding mirrors that of the commit log message and
hence changes with i18n.commitencoding.  Bad, but do we cater for this,
and on account of it make vim editing of native characters broken for
everyone else..
3, i18n.commitencoding doesn't work at all..  In which case the patch
itself is also ok.

Regards,

--
Atsushi Nakagawa
<at...@chejz.com>
Changes are made when there is inconvenience.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Johannes Sixt  
View profile  
 More options Dec 8 2011, 3:25 am
From: Johannes Sixt <j...@kdbg.org>
Date: Thu, 08 Dec 2011 09:25:36 +0100
Local: Thurs, Dec 8 2011 3:25 am
Subject: Re: [msysGit] [PATCH v10 00/28] Issue 80: Unicode support on Windows
Am 08.12.2011 03:28, schrieb Atsushi Nakagawa:

I don't think that these tests are representative. git-commit copies the
bytes between COMMIT_EDITMSG and the commit object without changing
them. Hence, whatever bytes vim wrote for the first commit, it will see
the same bytes for the second commit.

The purpose of i18n.commitencoding is to tell interested readers how to
interpret the byte sequence in the commit object, and not to ask git to
convert the bytes in COMMIT_EDITMSG to that encoding.

You have set the file encoding in vim to UTF-8. But it is likely that
the byte sequence that results from Japanese text is outside cp932; it
definitely is outside ASCII. (But you were lucky; git-commit didn't care
- it is not one of those "interested readers".)

(But perhaps you know all that already - I haven't followed this
sub-thread too closely.)

-- Hannes


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Atsushi Nakagawa  
View profile  
 More options Dec 8 2011, 10:37 pm
From: Atsushi Nakagawa <at...@chejz.com>
Date: Fri, 09 Dec 2011 12:37:17 +0900
Local: Thurs, Dec 8 2011 10:37 pm
Subject: Re: [msysGit] [PATCH v10 00/28] Issue 80: Unicode support on Windows

On Thu, 08 Dec 2011 09:25:36 +0100

Johannes Sixt <j...@kdbg.org> wrote:
> Am 08.12.2011 03:28, schrieb Atsushi Nakagawa:
> > [ lots of commands ]
> > ..., I don't know what's going on.

> I don't think that these tests are representative. git-commit copies the
> bytes between COMMIT_EDITMSG and the commit object without changing
> them. Hence, whatever bytes vim wrote for the first commit, it will see
> the same bytes for the second commit.

> [clarifications on workings of i18n.commitencoding]

Thanks for the heads up.

Okay, I've redone the tests to get a better understanding of what's
going on and the results are interesting.  The conclusions I draw is
that 1, merging the patch probably won't have major consequences in this
regard, and 2, I should include "git-rebase-todo" in the filename
pattern as well.

Anyways, here's what I did:

I made 6 commits with the following settings:
1: i18n.commitencoding= ; fenc=utf-8
2: i18n.commitencoding=cp932 ; fenc=utf-8
3: i18n.commitencoding=ascii ; fenc=utf-8
4: i18n.commitencoding= ; fenc=cp932
5: i18n.commitencoding=cp932 ; fenc=cp932
6: i18n.commitencoding=ascii ; fenc=cp932

I set `i18n.commitencoding' with 'get config' prior to committing and I
used ":set fenc=x" before saving to set the output encoding.  I included
a bit of Japanese text in the commit message for testing.  (Btw, "cp932"
(Shift JIS + Microsoft extensions) is the pretty much de-facto standard
charset for Japanese.)

I also put in an empty place-holder commit before "1" and tagged it
"base" for easier rebasing.

Finally, I tweaked `i18n.commitencoding' before each test because
surprisingly setting this after-the-fact seems to make a difference.

The results are: ("bad" means the text is garbled)

For 'git log' and 'gitk'
$ git config --unset i18n.commitencoding ; git log ; gitk
  (log -> good: 1, 2, 3, 5; bad: 4, 6)
  (gitk -> good: 1, 2, 3, 5; bad: 4, 6)

$ git config i18n.commitencoding cp932 ; git log ; gitk
  (log -> bad: 1, 2, 3, 4, 5, 6)
  (gitk -> good: 1, 2, 3, 4, 5, 6)

$ git config i18n.commitencoding ascii ; git log ; gitk
  (log -> good: 1, 2, 3, 5; bad: 4, 6)
  (gitk -> bad: 1, 2, 3, 4, 5, 6)

For 'rebase -i'
$ git config --unset i18n.commitencoding ; git rebase -i base
  (:e ++enc=utf-8 -> good: 1, 2, 3, 5; bad: 4, 6)
  (:e ++enc=cp932 -> good: 4, 6; bad: 1, 2, 3, 5)

  (Same if I set `i18n.commitencoding' to "cp932" or "ascii")

For 'git --amend'
$ git config --unset i18n.commitencoding
$ git checkout 1 && git commit --allow-empty --amend
  (:e ++enc=utf-8 -> good)

$ git checkout 2 && git commit --allow-empty --amend
  (:e ++enc=utf-8 -> good)

$ git checkout 3 && git commit --allow-empty --amend
  (:e ++enc=utf-8 -> good)

$ git checkout 4 && git commit --allow-empty --amend
  (:e ++enc=utf-8 -> bad)
  (:e ++enc=cp932 -> good)

$ git checkout 5 && git commit --allow-empty --amend
  (:e ++enc=utf-8 -> good)

$ git checkout 6 && git commit --allow-empty --amend
  (:e ++enc=utf-8 -> bad)
  (:e ++enc=cp932 -> good)

$ git config i18n.commitencoding cp932
$ (same sequence as above for 1 to 6)
  (1 to 6 -> "COMMIT_EDITMSG" came out "cp932")

$ git config i18n.commitencoding ascii
$ (same sequence as above for 1 to 6)
  (1, 2, 3, 5 -> "COMMIT_EDITMSG" came out "utf-8")
  (4, 6 -> "COMMIT_EDITMSG" came out "cp932")

So "COMMIT_EDITMSG" is normalized to whatever `i18n.commitencoding' is
at least on output (when amending).  Thankfully, `vim' seems to work
regardless of "set fencs=utf-8" because the "i18n.commitencoding=cp932"
test for 'commit --amend', resulted in `vim' choosing nothing for `fenc'
(defaulting to ANSI because `enc' is "cp932"). *

The result of "5" in the "ascii" version of 'git log' suggests the raw
bytes in "COMMIT_EDITMSG" may be normalized to an internal
representation of "utf-8" on commit as well..

* I thought I might have had to fix the patch to "set fencs=utf-8,ANSI",
which would have been a pain to do, but that doesn't seem to be the case.
A quick look at the ":help fencs" confirms this: if none of the `fencs'
encodings work, `vim' defaults to nothing, which is effectively ANSI in
our case.

Regards,

--
Atsushi Nakagawa
<at...@chejz.com>
Changes are made when there is inconvenience.

  fenc_utf-8_test_repos.bundle
1K Download

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Johannes Sixt  
View profile  
 More options Dec 9 2011, 2:50 am
From: Johannes Sixt <j...@kdbg.org>
Date: Fri, 09 Dec 2011 08:50:28 +0100
Local: Fri, Dec 9 2011 2:50 am
Subject: Re: [msysGit] [PATCH v10 00/28] Issue 80: Unicode support on Windows
Am 09.12.2011 04:37, schrieb Atsushi Nakagawa:

> Okay, I've redone the tests to get a better understanding of what's
> going on and the results are interesting.  The conclusions I draw is
> that 1, merging the patch probably won't have major consequences in this
> regard, and 2, I should include "git-rebase-todo" in the filename
> pattern as well.

> Anyways, here's what I did:

> I made 6 commits with the following settings:
> 1: i18n.commitencoding= ; fenc=utf-8
> 2: i18n.commitencoding=cp932 ; fenc=utf-8
> 3: i18n.commitencoding=ascii ; fenc=utf-8
> 4: i18n.commitencoding= ; fenc=cp932
> 5: i18n.commitencoding=cp932 ; fenc=cp932
> 6: i18n.commitencoding=ascii ; fenc=cp932

None of the commits in the bundle that you included has an 'encoding'
header. No wonder that the results are surprising (at least for me):

git cat-file commit db0369c
tree 4b825dc642cb6eb9a060e54bf8d69288fbee4904
parent fc3da0a016da7dabab284b5074a140265eb7df12
author Atsushi Nakagawa <at...@chejz.com> 1323394982 +0900
committer Atsushi Nakagawa <at...@chejz.com> 1323395503 +0900

i18n.commitencoding=cp932 ; fenc=cp932 ; テスト

Why is this? I made another commit on Linux like this:

git config i18n.commitencoding cp932
git commit --allow-empty -C fc3da0

then there is a header:

tree 4b825dc642cb6eb9a060e54bf8d69288fbee4904
parent 4a88c9a4b3f7ab384589e9780641bdadb7b09eb5
author Atsushi Nakagawa <at...@chejz.com> 1323394941 +0900
committer Johannes Sixt <j...@kdbg.org> 1323416504 +0100
encoding cp932

i18n.commitencoding= ; fenc=cp932 ; e X g

> Finally, I tweaked `i18n.commitencoding' before each test because
> surprisingly setting this after-the-fact seems to make a difference.

Yes, it looks like, if i18n.logoutputencoding is not set, then
i18n.commitencoding is used. Therefore, for the log tests, you should
eiter set the former, or unset the latter.

-- Hannes


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Atsushi Nakagawa  
View profile  
 More options Dec 9 2011, 6:15 am
From: Atsushi Nakagawa <at...@chejz.com>
Date: Fri, 09 Dec 2011 20:15:16 +0900
Local: Fri, Dec 9 2011 6:15 am
Subject: Re: [msysGit] [PATCH v10 00/28] Issue 80: Unicode support on Windows

On Fri, 09 Dec 2011 08:50:28 +0100

Oops, my bad.  I did a series of 'git commit --allow-empty -C's to
insert the base commit and it must've wiped it then.

> git cat-file commit db0369c

Aha, so that's how you look at the header as well as the raw bytes..

The raw bytes for "5" also looked doubtful so I recreated each commit
then redid each test:

For 'git log' and 'gitk'
$ git config --unset i18n.commitencoding ; git log ; gitk
  (log -> good: 1, 2, 3, 5; bad: 4, 6)
  (gitk -> good: 1, 2, 3, 5; bad: 4, 6)

$ git config i18n.commitencoding cp932 ; git log ; gitk
  (log -> good: 2, 3; bad: 1, 4, 5, 6)
  (gitk -> good: 1, 4, 5, 6; bad: 2, 3)

$ git config i18n.commitencoding ascii ; git log ; gitk
  (log -> good: 1, 2, 3; bad: 4, 5, 6)
  (gitk -> bad: 1, 2, 3, 4, 5, 6)

For 'rebase -i'
$ git config --unset i18n.commitencoding ; git rebase -i base
  (utf-8 lines: 1, 2, 3, 5; cp932 lines: 4, 6)

$ git config i18n.commitencoding cp932 ; git rebase -i base
  (utf-8 lines: 2, 3; cp932 lines: 1, 4, 5, 6)

$ git config i18n.commitencoding utf-8 ; git rebase -i base
  (utf-8 lines: 1, 2, 3; cp932 lines: 4, 5, 6)

For 'git --amend'
$ git config --unset i18n.commitencoding
$ git checkout 1 && git commit --allow-empty --amend
$ git checkout 2 && git commit --allow-empty --amend
$ git checkout 3 && git commit --allow-empty --amend
$ git checkout 4 && git commit --allow-empty --amend
$ git checkout 5 && git commit --allow-empty --amend
$ git checkout 6 && git commit --allow-empty --amend
  (utf-8 files: 1, 2, 3, 5; cp932 files: 4, 6)

$ git config i18n.commitencoding cp932
$ (same sequence as above for 1 to 6)
  (utf-8 files: 2, 3; cp932 files: 1, 4, 5, 6)

$ git config i18n.commitencoding ascii
$ (same sequence as above for 1 to 6)
  (utf-8 files: 1, 2, 3; cp932 files: 4, 5, 6)

So yeah, a more sensible result..  The commit message is converted on
output, to `i18n.commitencoding' (because `i18n.logoutputencoding' isn't
set) when possible, in a consistent manner across 'git log', 'rebase -i',
and 'git --amend'.  And my earlier conclusion about commit messages
being normalized on input to "utf-8" was bogus.

Regards,

--
Atsushi Nakagawa
<at...@chejz.com>
Changes are made when there is inconvenience.

  fenc_utf-8_test_repos_fixed.bundle
2K Download

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
End of messages < Older 
« Back to Discussions « Newer topic     Older topic »