Atsushi Nakagawa <at...@chejz.com> wrote: > > I think that currently the only way to reliably get UTF-8 in commit > > messages when using Vim is to set core.editor (or the GIT_EDITOR > > environment variable) to > > vim.exe -c "set enc=utf-8" > > or > > vim.exe -c "set fenc=utf-8" > > (I think the former is slightly more logical these days).
> I think the former would also require a "set tenc=ANSI" (where ANSI is > the specific ANSI codepage (ACP), as per Microsoft's definition, of > the system).
I think this is incorrect.
Note that one of the infamous properties of the Windows console and how "console" programs are supposed to interact with it is that (at least by default) the Windows console uses two different encodings: one for the output and another one for everything else--command-line arguments and file I/O. That is, any output function from stdio.h used in your typical console program is supposed to output text encoded in "OEM" encoding when it writes to the console stream, and use "ANSI" encoding when it writes files and interprets command-line arguments.
You can try creating a text file and printing its contents in the console window by running `type FILENAME`. You'll notice that the output will be readable only if the text is composed in the relevant "OEM" encoding *unless* the console code page has been changed (using the chcp command for instance).
As I understand, Git for Windows uses WriteConsoleW() to bypass normal stdio mechanics. This is done to output Unicode and let the console subsystem take care of the rest. But we do not control Vim, and for it to use the "OEM" tenc appears to be a sensible idea.
As to what Vim does, f I run vim -c "set enc=utf-8" for a stock Vim 7.3 in a cmd.exe windows on my Russian Windows XP, in response to :set enc? tenc? fenc? I get encoding=utf-8 termencoding=cp866 fileencoding= Which lists the expected termnial encoding (the default "OEM" code page for my flavor of Windows). For the Vim packaged with Git (when I run it from the Git's bash) the result is the same.
> The discussion here looks at the encoding of messages in new > commits, but we should also look at what happens when vim reloads > previous commit messages (e.g. for `rebase -i' and `commit --amend').
> In order to correctly handle those situations as well, the options > will probably become:
> vim --cmd "set enc=utf-8 tenc=ANSI" > or > vim --cmd "set fencs=utf-8"
> (i.e. `--cmd' rather than `-c' and `fencs' instead of `fenc') [...] > So it's a toss-up between these defaults: > set enc=utf-8 fencs= tenc=ANSI > or > set enc=ANSI fencs=utf-8 tenc=
> Maybe the former is conceptually better, but practically I don't think > the latter is less so.
> The former presents a challenge in that I don't know how to set "ANSI" > (the system dependent value, which is different from what vim means by > "ansi", and is "cp932" here). `enc' starts off with the ANSI setting > and I think it's done internally by vim.
I think we should just not touch tenc at all as Vim seems to do the right thing itself. See above.
> Another consideration is that the former doesn't work properly here (I > can't input native characters with that setting because vim goes out > of INSERT mode when I try). A short discussion at [1] says that vim > is flaky in unicode mode so maybe it's related to that.
[...]
Works here for Cyrillic. From your name and Windows codepage I assume you're inputting Japanese. I heard the input methods (as well as codepages) used for it are way more complicated than for roman/cyrillic alphabets. May be this is the issue here.
> Anyways, I propose the latter approach and will put together a patch > as soon as I've overcome some technical and time constraint hurdles. > Maybe I'll post an RFC of what I'm thinking of doing before that > because I'm on an network where I can't checkout external > repositories.
> > Messing with defaults is dangerous because on Windows you usually > > expect test editors to write out files in ANSI code page.
> It's could be best if it can be limited to COMMIT_EDITMSG, but since > that's a technical challenge in itself, I'm thinking it might be okay > if the defaults were changed via "C:\Program Files\Git\share\vim > \vimrc" because that's under "Git for Windows"s install path.
I wonder how this will play for the user with specific requirements like using certain i18n.commitencoding. I don't know if there are any, though.
On Wed, 7 Dec 2011, Atsushi Nakagawa wrote: > [... a long mail discussing vim encodings and then sent a patch to the > issue tracker that did not apply...]
I tried to clean up the patch and pushed the result as an/vim-utf-8. Since that discussion was a bit too long for my attention span I am unsure whether this should be merged as-is.
Johannes Schindelin <Johannes.Schinde...@gmx.de> wrote: > On Wed, 7 Dec 2011, Atsushi Nakagawa wrote: > > [... a long mail discussing vim encodings and then sent a patch to the > > issue tracker that did not apply...]
> I tried to clean up the patch and pushed the result as an/vim-utf-8.
Thanks Johannes! I was pondering whether I should try make that Google Code editor patch git-compatible by hand or if I should just wait till the weekends when I'm at home. I've looked over your clean-up changes (wrapping and the first word).
Regards,
-- Atsushi Nakagawa <at...@chejz.com> Changes are made when there is inconvenience.
Konstantin Khomoutov <flatw...@users.sourceforge.net> wrote: > On Wed, 07 Dec 2011 11:42:43 +0900 > Atsushi Nakagawa <at...@chejz.com> wrote:
> > > I think that currently the only way to reliably get UTF-8 in commit > > > messages when using Vim is to set core.editor (or the GIT_EDITOR > > > environment variable) to > > > vim.exe -c "set enc=utf-8" > > > or > > > vim.exe -c "set fenc=utf-8" > > > (I think the former is slightly more logical these days).
> > I think the former would also require a "set tenc=ANSI" (where ANSI is > > the specific ANSI codepage (ACP), as per Microsoft's definition, of > > the system).
> I think this is incorrect.
I've summarized the details to this reasoning to make sure I've correctly understood:
> [ The code page for `tenc' should be "OEM" and not "ANSI". > (Background information and examples followed.) ]
Thanks for that! I didn't realize there was a distinction between the two and used "ANSI" to mean both. (A quick look at Wikipedia's "Code page" article states (only?) Japanese, Chinese and Korean systems share the same for both, so the distinction seems to hold elsewhere.)
Nevertheless, I only used the word "ANSI" because that's what I assumed it was when I saw vim set `enc' to "cp932". My point about requiring `tenc' to be set will follow in response to below...
> [ `tenc' does not need to be set because vim already does the right > thing and sets it to the OEM codepage like so: ]
> ..., if I run > vim -c "set enc=utf-8" > for a stock Vim 7.3 in a cmd.exe windows on my Russian Windows XP, > in response to > :set enc? tenc? fenc? > I get > encoding=utf-8 > termencoding=cp866 > fileencoding=
This is interesting indeed. Because I get in response to the same sequence: encoding=utf-8 termencoding= fileencoding=
and if run without the `-c' argument: encoding=cp932 termencoding= fileencoding=
So yes, vim does the right thing in controlling `enc' and `tenc' so that (if we don't change anything) `tenc' and `fenc' will effectively and correctly be OEM and ANSI respectively.
However, the moment we change `enc', we get a case of varying mileage, and on a Japanese system `tenc' is required to be explicitly "set back" to OEM (cp932).
I think this is all the more reason to choose to your latter approach of only changing `fenc'. In the patch that Johannes kindly moved to "an/vim-utf-8", this is attempted by setting just `fencs'.
> I think we should just not touch tenc at all as Vim seems to do the > right thing itself. See above.
I agree. The question is then whether we avoid touching `enc' because vim will set `tenc' to nothing on certain systems making it depend on `enc'.
> > Another consideration is that the former doesn't work properly here (I > > can't input native characters with that setting because vim goes out > > of INSERT mode when I try). A short discussion at [1] says that vim > > is flaky in unicode mode so maybe it's related to that. > [...]
> Works here for Cyrillic. From your name and Windows codepage I assume > you're inputting Japanese. I heard the input methods (as well as > codepages) used for it are way more complicated than for roman/cyrillic > alphabets. May be this is the issue here.
The use of IME could be related, but the same scenario works if `enc' remains "cp932". I dunno, maybe something was overlooked in vim's input logic.
> > Anyways, I propose the latter approach and will put together a patch > > as soon as [...]
> > > Messing with defaults is dangerous because on Windows you usually > > > expect test editors to write out files in ANSI code page.
> > It's could be best if it can be limited to COMMIT_EDITMSG, but since > > that's a technical challenge in itself, I'm thinking it might be okay > > if the defaults were changed via "C:\Program Files\Git\share\vim > > \vimrc" because that's under "Git for Windows"s install path.
On a side note, I did end up limiting it to COMMIT_EDITMSG..
> I wonder how this will play for the user with specific requirements like > using certain i18n.commitencoding. I don't know if there are any, > though.
I did some quick tests... $ git init $ git config i18n.commitencoding cp932 $ git commit --allow-empty (input a japanese message with fenc=utf-8) $ git log (I see the message correctly) $ git commit --amend --allow-empty (I see the message correctly and fenc is utf-8)
$ git config i18n.commitencoding ascii $ git commit --allow-empty (input a japanese message with fenc=utf-8) $ git log (I see the message correctly) $ git commit --amend --allow-empty (I see the message correctly and fenc is utf-8)
At this point, I don't know what's going on.
Regardless, I think there're three possibilities: 1, COMMIT_EDITMSG, the interface between the commit log message and EDITOR, is utf-8 regardless of i18n.commitencoding. In which case everything's good. 2, COMMIT_EDITMSG's encoding mirrors that of the commit log message and hence changes with i18n.commitencoding. Bad, but do we cater for this, and on account of it make vim editing of native characters broken for everyone else.. 3, i18n.commitencoding doesn't work at all.. In which case the patch itself is also ok.
Regards,
-- Atsushi Nakagawa <at...@chejz.com> Changes are made when there is inconvenience.
> I did some quick tests... > $ git init > $ git config i18n.commitencoding cp932 > $ git commit --allow-empty > (input a japanese message with fenc=utf-8) > $ git log > (I see the message correctly) > $ git commit --amend --allow-empty > (I see the message correctly and fenc is utf-8)
> $ git config i18n.commitencoding ascii > $ git commit --allow-empty > (input a japanese message with fenc=utf-8) > $ git log > (I see the message correctly) > $ git commit --amend --allow-empty > (I see the message correctly and fenc is utf-8)
> At this point, I don't know what's going on.
I don't think that these tests are representative. git-commit copies the bytes between COMMIT_EDITMSG and the commit object without changing them. Hence, whatever bytes vim wrote for the first commit, it will see the same bytes for the second commit.
The purpose of i18n.commitencoding is to tell interested readers how to interpret the byte sequence in the commit object, and not to ask git to convert the bytes in COMMIT_EDITMSG to that encoding.
You have set the file encoding in vim to UTF-8. But it is likely that the byte sequence that results from Japanese text is outside cp932; it definitely is outside ASCII. (But you were lucky; git-commit didn't care - it is not one of those "interested readers".)
(But perhaps you know all that already - I haven't followed this sub-thread too closely.)
Johannes Sixt <j...@kdbg.org> wrote: > Am 08.12.2011 03:28, schrieb Atsushi Nakagawa: > > [ lots of commands ] > > ..., I don't know what's going on.
> I don't think that these tests are representative. git-commit copies the > bytes between COMMIT_EDITMSG and the commit object without changing > them. Hence, whatever bytes vim wrote for the first commit, it will see > the same bytes for the second commit.
> [clarifications on workings of i18n.commitencoding]
Thanks for the heads up.
Okay, I've redone the tests to get a better understanding of what's going on and the results are interesting. The conclusions I draw is that 1, merging the patch probably won't have major consequences in this regard, and 2, I should include "git-rebase-todo" in the filename pattern as well.
Anyways, here's what I did:
I made 6 commits with the following settings: 1: i18n.commitencoding= ; fenc=utf-8 2: i18n.commitencoding=cp932 ; fenc=utf-8 3: i18n.commitencoding=ascii ; fenc=utf-8 4: i18n.commitencoding= ; fenc=cp932 5: i18n.commitencoding=cp932 ; fenc=cp932 6: i18n.commitencoding=ascii ; fenc=cp932
I set `i18n.commitencoding' with 'get config' prior to committing and I used ":set fenc=x" before saving to set the output encoding. I included a bit of Japanese text in the commit message for testing. (Btw, "cp932" (Shift JIS + Microsoft extensions) is the pretty much de-facto standard charset for Japanese.)
I also put in an empty place-holder commit before "1" and tagged it "base" for easier rebasing.
Finally, I tweaked `i18n.commitencoding' before each test because surprisingly setting this after-the-fact seems to make a difference.
The results are: ("bad" means the text is garbled)
$ git config i18n.commitencoding cp932 $ (same sequence as above for 1 to 6) (1 to 6 -> "COMMIT_EDITMSG" came out "cp932")
$ git config i18n.commitencoding ascii $ (same sequence as above for 1 to 6) (1, 2, 3, 5 -> "COMMIT_EDITMSG" came out "utf-8") (4, 6 -> "COMMIT_EDITMSG" came out "cp932")
So "COMMIT_EDITMSG" is normalized to whatever `i18n.commitencoding' is at least on output (when amending). Thankfully, `vim' seems to work regardless of "set fencs=utf-8" because the "i18n.commitencoding=cp932" test for 'commit --amend', resulted in `vim' choosing nothing for `fenc' (defaulting to ANSI because `enc' is "cp932"). *
The result of "5" in the "ascii" version of 'git log' suggests the raw bytes in "COMMIT_EDITMSG" may be normalized to an internal representation of "utf-8" on commit as well..
* I thought I might have had to fix the patch to "set fencs=utf-8,ANSI", which would have been a pain to do, but that doesn't seem to be the case. A quick look at the ":help fencs" confirms this: if none of the `fencs' encodings work, `vim' defaults to nothing, which is effectively ANSI in our case.
Regards,
-- Atsushi Nakagawa <at...@chejz.com> Changes are made when there is inconvenience.
> Okay, I've redone the tests to get a better understanding of what's > going on and the results are interesting. The conclusions I draw is > that 1, merging the patch probably won't have major consequences in this > regard, and 2, I should include "git-rebase-todo" in the filename > pattern as well.
> Anyways, here's what I did:
> I made 6 commits with the following settings: > 1: i18n.commitencoding= ; fenc=utf-8 > 2: i18n.commitencoding=cp932 ; fenc=utf-8 > 3: i18n.commitencoding=ascii ; fenc=utf-8 > 4: i18n.commitencoding= ; fenc=cp932 > 5: i18n.commitencoding=cp932 ; fenc=cp932 > 6: i18n.commitencoding=ascii ; fenc=cp932
None of the commits in the bundle that you included has an 'encoding' header. No wonder that the results are surprising (at least for me):
tree 4b825dc642cb6eb9a060e54bf8d69288fbee4904 parent 4a88c9a4b3f7ab384589e9780641bdadb7b09eb5 author Atsushi Nakagawa <at...@chejz.com> 1323394941 +0900 committer Johannes Sixt <j...@kdbg.org> 1323416504 +0100 encoding cp932
i18n.commitencoding= ; fenc=cp932 ; e X g
> Finally, I tweaked `i18n.commitencoding' before each test because > surprisingly setting this after-the-fact seems to make a difference.
Yes, it looks like, if i18n.logoutputencoding is not set, then i18n.commitencoding is used. Therefore, for the log tests, you should eiter set the former, or unset the latter.
Johannes Sixt <j...@kdbg.org> wrote: > Am 09.12.2011 04:37, schrieb Atsushi Nakagawa: > > Okay, I've redone the tests to get a better understanding of what's > > going on and the results are interesting. ...
> > Anyways, here's what I did:
> > I made 6 commits with the following settings: > > 1: i18n.commitencoding= ; fenc=utf-8 > > 2: i18n.commitencoding=cp932 ; fenc=utf-8 > > 3: i18n.commitencoding=ascii ; fenc=utf-8 > > 4: i18n.commitencoding= ; fenc=cp932 > > 5: i18n.commitencoding=cp932 ; fenc=cp932 > > 6: i18n.commitencoding=ascii ; fenc=cp932
> None of the commits in the bundle that you included has an 'encoding' > header. No wonder that the results are surprising (at least for me):
Oops, my bad. I did a series of 'git commit --allow-empty -C's to insert the base commit and it must've wiped it then.
> git cat-file commit db0369c
Aha, so that's how you look at the header as well as the raw bytes..
The raw bytes for "5" also looked doubtful so I recreated each commit then redid each test:
$ git config i18n.commitencoding cp932 $ (same sequence as above for 1 to 6) (utf-8 files: 2, 3; cp932 files: 1, 4, 5, 6)
$ git config i18n.commitencoding ascii $ (same sequence as above for 1 to 6) (utf-8 files: 1, 2, 3; cp932 files: 4, 5, 6)
So yeah, a more sensible result.. The commit message is converted on output, to `i18n.commitencoding' (because `i18n.logoutputencoding' isn't set) when possible, in a consistent manner across 'git log', 'rebase -i', and 'git --amend'. And my earlier conclusion about commit messages being normalized on input to "utf-8" was bogus.
Regards,
-- Atsushi Nakagawa <at...@chejz.com> Changes are made when there is inconvenience.