msy...@googlegroups.com wrote on 15.08.2012 11:44:34:
> E>> Interesting. Do you have a script that reproduces it? I cannot find
> E>> the string "unknown symbol" in our sources, though. Are you sure that
> E>> warning is output by Git?
> E>>
> E>> Karsten, do you think this is a bug?
>
> K> Yes, I think it is a bug in the Unicode specification to allow BOM
> K> in UTF-8 :-)
>
> K> Byte order in UTF-8 is always the same on all platforms, a Byte Order
> K> Mark (BOM) has no purpose here.
>
> It serves the purpose to clearly mark UTF8 stream.
No it doesn't. In many encodings, 0xef 0xbb and 0xbf are just normal letters, so files starting with these bytes can actually be any encoding. It is generally impossible to reliably detect the encoding of plain text files from their (encoded) content alone. The best you can do is make a good guess, and you don't need the BOM for that.
POSIX systems typically don't support BOM, as it breaks ASCII compatibility and conflicts with the convention that shell scripts start with "#!".
Windows still doesn't support UTF-8 system encoding, so the BOM in text files is needed here to distinguish between UTF-8 and legacy system encoding. IMO this is a Windows specific workaround for a Windows specific problem.
> While other heuristics do exist, BOM presence is most accurate.
> Some libraries even call that preamble, rather than BOM
>
> K> Git tries hard not to modify content, including the commit message.
> K> This is well documented in the git-commit man page ("The commit log
> K> messages are uninterpreted sequences of non-NUL bytes."). If you
> K> tell git to store a BOM character in the commit message, git willdo that.
>
> Makes sense. In the storage it is to be kept "as is"
> That however does not mean BOM is to be preserved when reading from
> storage, when merging few text streams into one.
>
> K> AFAIK, the UTF-8 BOM is designed
> K> to be composted of purely non-visible characters, so should
> K> we really complain (if it turns out we do, that is)?
>
> I wonder how UTF-16 would be worked there :-)
>
'git-commit -F <UTF-16 file>' fails with error because UTF-16 tends to be full of NUL bytes.
> The trouble is that windows console is not unicode-aware. So to output
> something on it, the terxtshould be transcoded in advance.
> $FFFE has not correspondence in non-unicode codepages. So it is probably
> replaced with kind of joker.
>
> K> According to the Unicode spec, BOM in the middle of text should be
> K> treated as zero width non-breaking space. We could filter BOM
> K> characters in our winansi.c console driver, to fix the console
> K> --no-pager case. I don't know how less, gitk, git-gui etc. handle BOMs.
>
> As i wote above, Gitk works fine. git-gui just has no means to show
> the log, it
> calls gitk for that.
> Less becomes a problem though.
>
> Where is the part, transcoding message to CP_OEMCP ?
Winansi.c converts directly to UTF-16, then uses WriteConsoleW to print. The OEM code page is irrelevant.
> If it is in winansi.c, and it can tell if input stream is Unicode or not,
> and provided "zero width" is requested by Unicode standard, i bet that
> filtering BOMs out when transcoding to non-Unicode charsets would be
> the best way.
>
> K> IMO the best solution is not to use BOMs in commit messages.
>
> Well, i'd probably hack into library and suppress it.
> However, it is not disallowed, so is expected to be handled :-)
>
AFAIK, the Unicode spec says BOM in UTF-8 is optional. It doesn't say it MAY be added when writing, but MUST be handled when reading...nice try, though :-)