commit messages: BOM or not

340 views
Skip to first unread message

Arioch

unread,
Aug 14, 2012, 6:00:21 AM8/14/12
to msy...@googlegroups.com
Default commit encoding is UTF-8

Unicode can be bare or BOM-prepended.
http://unicode.org/faq/utf_bom.html

Using most easy implementation, IDE integration i try to enhance, makes 3-bytes
BOM

GUI tools like that IDE itself or TortoiseGit show such messages transparently.
Same for GiTk / Git-GUI
I guess Git Cola and other GUI tools would do the same.

However CLI git log shows "unknown symbol" square before message text.

It clearly outputs UTF8 without parsing it and just appends it to the stream it
pushes to stdout, so that BOM is no in the middle of the stream.

http://imgur.com/7oDnM

If some application would parse plain "git log" output as is - though it is not
intended way for integration of course - it would not be able to strip that
garbage off.

That looks like either CLI Git is doing incorrectly here.
Or Git commit objects are to be bare, not including BOM.
Or it is just gray area with lack of specifications ?



Erik Faye-Lund

unread,
Aug 14, 2012, 6:46:20 AM8/14/12
to Arioch, msy...@googlegroups.com, Karsten Blees
On Tue, Aug 14, 2012 at 12:00 PM, Arioch <the_A...@nm.ru> wrote:
> Default commit encoding is UTF-8
>
> Unicode can be bare or BOM-prepended.
> http://unicode.org/faq/utf_bom.html
>
> Using most easy implementation, IDE integration i try to enhance, makes 3-bytes
> BOM
>
> GUI tools like that IDE itself or TortoiseGit show such messages transparently.
> Same for GiTk / Git-GUI
> I guess Git Cola and other GUI tools would do the same.
>
> However CLI git log shows "unknown symbol" square before message text.

Interesting. Do you have a script that reproduces it? I cannot find
the string "unknown symbol" in our sources, though. Are you sure that
warning is output by Git?

Karsten, do you think this is a bug? AFAIK, the UTF-8 BOM is designed
to be composted of purely non-visible characters, so should we really
complain (if it turns out we do, that is)?

> It clearly outputs UTF8 without parsing it and just appends it to the stream it
> pushes to stdout, so that BOM is no in the middle of the stream.
>
> http://imgur.com/7oDnM
>
> If some application would parse plain "git log" output as is - though it is not
> intended way for integration of course - it would not be able to strip that
> garbage off.
>
> That looks like either CLI Git is doing incorrectly here.
> Or Git commit objects are to be bare, not including BOM.
> Or it is just gray area with lack of specifications ?

Regardless of the above, I would say that including the BOM is a
mistake. This is because commits without an explicit legacy-encoding
stored in the commit object is already assumed to be UTF-8 (see the
documentation of i18n.commitEncoding on
http://git-scm.com/docs/git-config), so there's no point in
potentially confusing other tools with the BOM.

Arioch

unread,
Aug 14, 2012, 7:20:32 AM8/14/12
to msy...@googlegroups.com
> > However CLI git log shows "unknown symbol" square before message text.
>
> Interesting. Do you have a script that reproduces it?

No script.
Just git commit -F :-)

http://www30.zippyshare.com/v/40905982/file.html

> I cannot find
> the string "unknown symbol" in our sources, though. Are you sure that
> warning is output by Git?

Did you saw the screenshot. That is not the warning, but a placeholder OS
renders for unavailable characters.
And since CLI git consists of many different tools pipelined, like less and
such, i can't even imagine what was the route of that char.

I can only see little piece of garbage in console output.

> > It clearly outputs UTF8 without parsing it and just appends it to the
stream it
> > pushes to stdout, so that BOM is no in the middle of the stream.
> >
> > http://imgur.com/7oDnM


^^^^^^


> mistake. This is because commits without an explicit legacy-encoding
> stored in the commit object is already assumed to be UTF-8 (see the
> documentation of i18n.commitEncoding on
> http://git-scm.com/docs/git-config), so there's no point in
> potentially confusing other tools with the BOM.

Still BOM is allowed and not suggested against, and many libs put it to mark
UTF8 presence or for keeping code generic for different UTFs.

I'd look if it will be easy to suppress BOM there.

Overall i'd liked to keep file open while Git is running and only close it
after (with Windows auto-deleting it). Whether that would be possible, i'd have
to do encoding transformation before saving data and would have more control of
it. As of now i just use standard file i/o specifying that file is to be UTF-8.
The rest is made by language RTL and i am not sure it allwos BOM supressing.

PS. i feel a mistake that git config --get -z i18n.commitencoding return empty
string to me. Why would not it return default one ?
that would make it's output easier to parse and would make Git free to change
it's defaults later without breaking applications. :-/
Instead of making it centralized in Git once for all, now every application has
to re-implement it.

Erik Faye-Lund

unread,
Aug 14, 2012, 7:44:07 AM8/14/12
to Arioch, msy...@googlegroups.com, Karsten Blees
Please do not cull the CC-list. Hit "Reply to All" in your e-mail
program instead of "Reply".

On Tue, Aug 14, 2012 at 1:20 PM, Arioch <the_A...@nm.ru> wrote:
>> I cannot find
>> the string "unknown symbol" in our sources, though. Are you sure that
>> warning is output by Git?
>
> Did you saw the screenshot. That is not the warning, but a placeholder OS
> renders for unavailable characters.
> And since CLI git consists of many different tools pipelined, like less and
> such, i can't even imagine what was the route of that char.
>
> I can only see little piece of garbage in console output.
>
>> > It clearly outputs UTF8 without parsing it and just appends it to the
> stream it
>> > pushes to stdout, so that BOM is no in the middle of the stream.
>> >
>> > http://imgur.com/7oDnM
>
>
> ^^^^^^
>

Aha, thanks for pointing that out, I feel stupid :)

This might be a glitch in the UTF-8 output. But Karsten is better
qualified to answer this.

Actually, thinking of it, I don't think it is. The behavior is
documented: "Character encoding the commit messages are stored in; git
itself does not care per se <...>". So, Git does not care about the
encoding, it only records what it was reported as. So it cannot be
reasonably assumed that git strips it, and since a BOM should always
come at the beginning of some output, a present BOM will break the
output. We don't output the commit message at the beginning, in fact,
the code that is UTF-8 aware has no way of knowing where to search for
the BOM.

So I think the answer here is a clear "don't do that".

>> mistake. This is because commits without an explicit legacy-encoding
>> stored in the commit object is already assumed to be UTF-8 (see the
>> documentation of i18n.commitEncoding on
>> http://git-scm.com/docs/git-config), so there's no point in
>> potentially confusing other tools with the BOM.
>
> Still BOM is allowed and not suggested against, and many libs put it to mark
> UTF8 presence or for keeping code generic for different UTFs.
>
> I'd look if it will be easy to suppress BOM there.
>
> Overall i'd liked to keep file open while Git is running and only close it
> after (with Windows auto-deleting it). Whether that would be possible, i'd have
> to do encoding transformation before saving data and would have more control of
> it. As of now i just use standard file i/o specifying that file is to be UTF-8.
> The rest is made by language RTL and i am not sure it allwos BOM supressing.
>
> PS. i feel a mistake that git config --get -z i18n.commitencoding return empty
> string to me. Why would not it return default one ?
> that would make it's output easier to parse and would make Git free to change
> it's defaults later without breaking applications. :-/
> Instead of making it centralized in Git once for all, now every application has
> to re-implement it.
>

These are all questions you would have to ask upstream. They are in no
way specific to Git for Windows.

If we modify Git for Windows to act differently than upstream Git WRT
BOMs, then we're opening up a horrible can of worms where the same
commits generate different SHA-1s on Windows and Linux.

Arioch

unread,
Aug 15, 2012, 5:09:26 AM8/15/12
to msy...@googlegroups.com
> Please do not cull the CC-list. Hit "Reply to All" in your e-mail
> program instead of "Reply".

i cannot have e-mail access here.
I use www interface and it does send mail to the list.
See there: http://thread.gmane.org/gmane.comp.version-control.msysgit/16557/
focus=16564
All the messages are in place publicly.

> >> > http://imgur.com/7oDnM

> This might be a glitch in the UTF-8 output. But Karsten is better
> qualified to answer this.


> Actually, thinking of it, I don't think it is. The behavior is
> documented: "Character encoding the commit messages are stored in; git
> itself does not care per se <...>". So, Git does not care about the
> encoding, it only records what it was reported as.

Internally - yes.
But when Git interacts with external programs like console, it tries to recode
messages to appropriate charset.

The very screenshot shows that Git, when interacting with LESS, did transcoding
of UTF-8 comment into ibm866 charset (Russian letters for DOS,OS/2 and textual
Windows)

If it is already transcoding, then Unicode support is presumed.
And Unicode may ionclude BOM by standard.

If Git only partially supports standard, BOM-less, then it should be documented.

And yes, i think that is windows-specific trouble, for modern Linux just settle
user accounts to UTF-8 including console, which Windows console can not do.

I would not be surprised in on Linux that commit text would be stripped of BOM
by some of subsystems.

> So I think the answer here is a clear "don't do that".

Yes, i'd try to suppress it.
It probably would be easier for me to adapt to Git suit bug, than to consider
all the programs and libs it contains of and debug their interactions :-)

I wonder how Git would treat another Unicode formats with or without BOM :-)




Arioch

unread,
Aug 15, 2012, 5:44:34 AM8/15/12
to msy...@googlegroups.com
E>> Interesting. Do you have a script that reproduces it? I cannot find
E>> the string "unknown symbol" in our sources, though. Are you sure that
E>> warning is output by Git?
E>>
E>> Karsten, do you think this is a bug?

K> Yes, I think it is a bug in the Unicode specification to allow BOM
K> in UTF-8 :-)

K> Byte order in UTF-8 is always the same on all platforms, a Byte Order
K> Mark (BOM) has no purpose here.

It serves the purpose to clearly mark UTF8 stream.
While other heuristics do exist, BOM presence is most accurate.
Some libraries even call that preamble, rather than BOM

K> Git tries hard not to modify content, including the commit message.
K> This is well documented in the git-commit man page ("The commit log
K> messages are uninterpreted sequences of non-NUL bytes."). If you
K> tell git to store a BOM character in the commit message, git will do that.

Makes sense. In the storage it is to be kept "as is"
That however does not mean BOM is to be preserved when reading from
storage, when merging few text streams into one.

K> AFAIK, the UTF-8 BOM is designed
K> to be composted of purely non-visible characters, so should
K> we really complain (if it turns out we do, that is)?

I wonder how UTF-16 would be worked there :-)

The trouble is that windows console is not unicode-aware. So to output
something on it, the terxtshould be transcoded in advance.
$FFFE has not correspondence in non-unicode codepages. So it is probably
replaced with kind of joker.

K> According to the Unicode spec, BOM in the middle of text should be
K> treated as zero width non-breaking space. We could filter BOM
K> characters in our winansi.c console driver, to fix the console
K> --no-pager case. I don't know how less, gitk, git-gui etc. handle BOMs.

As i wote above, Gitk works fine. git-gui just has no means to show the log, it
calls gitk for that.
Less becomes a problem though.

Where is the part, transcoding message to CP_OEMCP ?
If it is in winansi.c, and it can tell if input stream is Unicode or not,
and provided "zero width" is requested by Unicode standard, i bet that
filtering BOMs out when transcoding to non-Unicode charsets would be
the best way.

K> IMO the best solution is not to use BOMs in commit messages.

Well, i'd probably hack into library and suppress it.
However, it is not disallowed, so is expected to be handled :-)

K> Hth,
K> Karsten

Erik Faye-Lund

unread,
Aug 15, 2012, 5:49:41 AM8/15/12
to Arioch, msysGit, Karsten Blees
On Wed, Aug 15, 2012 at 11:09 AM, Arioch <the_A...@nm.ru> wrote:
>> Please do not cull the CC-list. Hit "Reply to All" in your e-mail
>> program instead of "Reply".
>
> i cannot have e-mail access here.
> I use www interface and it does send mail to the list.
> See there: http://thread.gmane.org/gmane.comp.version-control.msysgit/16557/
> focus=16564
> All the messages are in place publicly.

If your e-mail client is broken, please make up for it by adding back
the missing CCs when you reply, then. By preserving the CC-list, not
only does the interested parties get an e-mail in their inbox, we can
include people who does not subscribe to the mailing list without them
having to accidentally find the discussion.

>> Actually, thinking of it, I don't think it is. The behavior is
>> documented: "Character encoding the commit messages are stored in; git
>> itself does not care per se <...>". So, Git does not care about the
>> encoding, it only records what it was reported as.
>
> Internally - yes.
> But when Git interacts with external programs like console, it tries to recode
> messages to appropriate charset.
>
> The very screenshot shows that Git, when interacting with LESS, did transcoding
> of UTF-8 comment into ibm866 charset (Russian letters for DOS,OS/2 and textual
> Windows)
>
> If it is already transcoding, then Unicode support is presumed.
> And Unicode may ionclude BOM by standard.
>
> If Git only partially supports standard, BOM-less, then it should be documented.
>
> And yes, i think that is windows-specific trouble, for modern Linux just settle
> user accounts to UTF-8 including console, which Windows console can not do.
>
> I would not be surprised in on Linux that commit text would be stripped of BOM
> by some of subsystems.
>

Some experimentation on some of my systems show that:
- On Linux: non-breaking zero-width space DOES NOT get printed when
you write it to a terminal (I would not be surprised if this varied
with different terminals).
- On Windows: non-breaking zero-width space DOES get printed when you
write it to a console
- On both: less prints it as <U+FEFF> when it encounters this.

So, since the Windows console isn't nice to us and strip it, perhaps
we should strip it ourselves? The following patch seems to fix it for
me. Karsten, what do you think?

---8<---
diff --git a/compat/winansi.c b/compat/winansi.c
index 040ef5a..0c8f40f 100644
--- a/compat/winansi.c
+++ b/compat/winansi.c
@@ -112,12 +112,22 @@ static int is_console(int fd)

static void write_console(unsigned char *str, size_t len)
{
+ int i, removed = 0;
/* only called from console_thread, so a static buffer will do */
static wchar_t wbuf[2 * BUFFER_SIZE + 1];

/* convert utf-8 to utf-16 */
int wlen = xutftowcsn(wbuf, (char*) str, ARRAY_SIZE(wbuf), len);

+ for (i = 0; i < wlen; ++i) {
+ if (wbuf[i] == 0xFEFF) {
+ ++removed;
+ --wlen;
+ }
+ if (removed)
+ wbuf[i] = wbuf[i + removed];
+ }
+
/* write directly to console */
WriteConsoleW(console, wbuf, wlen, NULL, NULL);

---8<---

It feels a bit hacky, though. Perhaps there are some better way of
finding out that this is a code-point that should be rendered as
invisible also? Is there other such code-points?

>> So I think the answer here is a clear "don't do that".
>
> Yes, i'd try to suppress it.
> It probably would be easier for me to adapt to Git suit bug, than to consider
> all the programs and libs it contains of and debug their interactions :-)
>
> I wonder how Git would treat another Unicode formats with or without BOM :-)

Git does not support other unicode-encodings than UTF-8 at all. And it
never will.

Erik Faye-Lund

unread,
Aug 15, 2012, 5:55:32 AM8/15/12
to Arioch, msysGit, Karsten Blees
So, just to make it clear: the issue on Windows is not about the UTF-8
BOM. It's about non-breaking zero-width space (and possibly other
code-points that shouldn't be rendered).

Git does not consider the BOM. It simply preserves exactly what you
entered. Perhaps this is good enough for you, I don't know. But
knowingly passing in a BOM opens up pandoras box with regard to
tool-support.

So, I think the patch might be an improvement, but I still stay by my
"don't do that"-comment. Don't pass BOMs to UNIX-tools, including Git.

Arioch

unread,
Aug 15, 2012, 6:31:45 AM8/15/12
to msy...@googlegroups.com
> If your e-mail client is broken,

i don't have such.

> please make up for it by adding back
> the missing CCs when you reply, then.

thus nowhere to add.

i can't say if WWW gateway does follow some policy or not.
but anyway i am not in control of it.

> Some experimentation on some of my systems show that:
> - On Linux: non-breaking zero-width space DOES NOT get printed when
> you write it to a terminal (I would not be surprised if this varied
> with different terminals).

and terminal locale is unicode ?
what if you set user account with non-unicode locale ?

> - On Windows: non-breaking zero-width space DOES get printed when you
> write it to a console

Well, are you sure you do write it, up to Unicode-aware Win32 API
WriteConsoleW ?
Or you think so, and really text is transcoded in the way to non-Unicode
charset ?

http://msdn.microsoft.com/en-us/library/windows/desktop/dd317756.aspx

UTF8 id is 65001.

i open console, issue "mode con cp select-65001" and "chcp 65001" - and still
git log shows that substitute...


> - On both: less prints it as <U+FEFF> when it encounters this.

That means less uses UCS-2 or UTf-16 rather than UTF-8 ?
I though less is a 8-bit per char program




Erik Faye-Lund

unread,
Aug 15, 2012, 6:51:23 AM8/15/12
to Arioch, msy...@googlegroups.com, Karsten Blees
On Wed, Aug 15, 2012 at 12:31 PM, Arioch <the_A...@nm.ru> wrote:
>> If your e-mail client is broken,
>
> i don't have such.
>
>> please make up for it by adding back
>> the missing CCs when you reply, then.
>
> thus nowhere to add.
>
> i can't say if WWW gateway does follow some policy or not.
> but anyway i am not in control of it.
>

Then you SHOULD use an e-mail client, and get in control of it. This
is basic netiquette, really. Having to restore the CC-list all the
time is tiresome and annoying, and makes me less inclined to bother to
fix your problems.

>> Some experimentation on some of my systems show that:
>> - On Linux: non-breaking zero-width space DOES NOT get printed when
>> you write it to a terminal (I would not be surprised if this varied
>> with different terminals).
>
> and terminal locale is unicode ?
> what if you set user account with non-unicode locale ?
>

There is no such thing as a Unicode locale on Windows. There are
Unicode code-pages, but they aren't the same thing.

>> - On Windows: non-breaking zero-width space DOES get printed when you
>> write it to a console
>
> Well, are you sure you do write it, up to Unicode-aware Win32 API
> WriteConsoleW ?
> Or you think so, and really text is transcoded in the way to non-Unicode
> charset ?
>
> http://msdn.microsoft.com/en-us/library/windows/desktop/dd317756.aspx
>
> UTF8 id is 65001.
>
> i open console, issue "mode con cp select-65001" and "chcp 65001" - and still
> git log shows that substitute...
>

On Windows, there are two ways of writing to the console.
WriteConsoleA which uses the current locale (which cannot be set to
UTF-8, unfortunately), and WriteConsoleW which always use UTF-16.

If you look at compat/winansi.c (the relevant code is even quoted in
my patch), you'll see that we convert from UTF-8 to UTF-16 and write
to the console with WriteConsoleW. So yes, we do Unicode. And we do it
properly (modulo bugs, missing work-arounds for other tools' lacks).
This is all thanks to great work by Karsten :)

>> - On both: less prints it as <U+FEFF> when it encounters this.
>
> That means less uses UCS-2 or UTf-16 rather than UTF-8 ?
> I though less is a 8-bit per char program
>

Not quite, it means that my copies of less groks Unicode, both on
Windows and Linux. Unicode isn't 8 or 16 bit oriented, it's code-point
oriented. And less correctly detects that all these 4 UTF-8 encoded
bytes is really just one code-point. It only looks like UTF-16 because
this particular code-point simply encodes as the code-point in UTF-16.
It's basically the same concept as with the first 128 code-points in
UTF-8.

karste...@dcon.de

unread,
Aug 15, 2012, 11:48:16 AM8/15/12
to Arioch, msy...@googlegroups.com

msy...@googlegroups.com wrote on 15.08.2012 11:44:34:
> E>> Interesting. Do you have a script that reproduces it? I cannot find
> E>> the string "unknown symbol" in our sources, though. Are you sure that
> E>> warning is output by Git?
> E>>
> E>> Karsten, do you think this is a bug?
>
> K> Yes, I think it is a bug in the Unicode specification to allow BOM
> K> in UTF-8 :-)
>
> K> Byte order in UTF-8 is always the same on all platforms, a Byte Order
> K> Mark (BOM) has no purpose here.
>
> It serves the purpose to clearly mark UTF8 stream.


No it doesn't. In many encodings, 0xef 0xbb and 0xbf are just normal letters, so files starting with these bytes can actually be any encoding. It is generally impossible to reliably detect the encoding of plain text files from their (encoded) content alone. The best you can do is make a good guess, and you don't need the BOM for that.

POSIX systems typically don't support BOM, as it breaks ASCII compatibility and conflicts with the convention that shell scripts start with "#!".

Windows still doesn't support UTF-8 system encoding, so the BOM in text files is needed here to distinguish between UTF-8 and legacy system encoding. IMO this is a Windows specific workaround for a Windows specific problem.



> While other heuristics do exist, BOM presence is most accurate.
> Some libraries even call that preamble, rather than BOM
>

> K> Git tries hard not to modify content, including the commit message.
> K> This is well documented in the git-commit man page ("The commit log
> K> messages are uninterpreted sequences of non-NUL bytes."). If you

> K> tell git to store a BOM character in the commit message, git willdo that.

>
> Makes sense. In the storage it is to be kept "as is"
> That however does not mean BOM is to be preserved when reading from
> storage, when merging few text streams into one.
>
> K> AFAIK, the UTF-8 BOM is designed
> K> to be composted of purely non-visible characters, so should
> K> we really complain (if it turns out we do, that is)?
>
> I wonder how UTF-16 would be worked there :-)
>



'git-commit -F <UTF-16 file>' fails with error because UTF-16 tends to be full of NUL bytes.



> The trouble is that windows console is not unicode-aware. So to output
> something on it, the terxtshould be transcoded in advance.
> $FFFE has not correspondence in non-unicode codepages. So it is probably
> replaced with kind of joker.
>
> K> According to the Unicode spec, BOM in the middle of text should be
> K> treated as zero width non-breaking space. We could filter BOM
> K> characters in our winansi.c console driver, to fix the console
> K> --no-pager case. I don't know how less, gitk, git-gui etc. handle BOMs.
>
> As i wote above, Gitk works fine. git-gui just has no means to show
> the log, it
> calls gitk for that.
> Less becomes a problem though.
>
> Where is the part, transcoding message to CP_OEMCP ?


Winansi.c converts directly to UTF-16, then uses WriteConsoleW to print. The OEM code page is irrelevant.



> If it is in winansi.c, and it can tell if input stream is Unicode or not,
> and provided "zero width" is requested by Unicode standard, i bet that
> filtering BOMs out when transcoding to non-Unicode charsets would be
>  the best way.
>
> K> IMO the best solution is not to use BOMs in commit messages.
>
> Well, i'd probably hack into library and suppress it.
> However, it is not disallowed, so is expected to be handled :-)
>


AFAIK, the Unicode spec says BOM in UTF-8 is optional. It doesn't say it MAY be added when writing, but MUST be handled when reading...nice try, though :-)

Reply all
Reply to author
Forward
0 new messages