gc and utf-8 BOM

471 views
Skip to first unread message

gzeljko

unread,
Dec 16, 2010, 2:29:41 PM12/16/10
to golang-nuts
Hi,
It would be nice if compiler supports sources with BOM.

Rob 'Commander' Pike

unread,
Dec 16, 2010, 2:38:47 PM12/16/10
to gzeljko, golang-nuts

On Dec 16, 2010, at 11:29 AM, gzeljko wrote:

> Hi,
> It would be nice if compiler supports sources with BOM.

Go's source is in UTF-8. BOM is not valid UTF-8.

-rob

Gordon Tisher

unread,
Dec 16, 2010, 4:37:55 PM12/16/10
to Rob 'Commander' Pike, golang-nuts

The UTF-8 BOM is EF BB BF. From the Unicode 5 Standard, section 16.8:

"In UTF-8, the BOM corresponds to the byte sequence <EF BB BF>. Although
there are never any questions of byte order with UTF-8 text, this
sequence can serve as signature for UTF-8 encoded text where the
character set is unmarked. As with a BOM in UTF-16, this sequence of
bytes will be extremely rare at the beginning of text files in other
character encodings."

From the Unicode FAQ (http://unicode.org/faq/utf_bom.html#bom5, my
emphasis):

"Q: Can a UTF-8 data stream contain the BOM character (in UTF-8 form)?
If yes, then can I still assume the remaining UTF-8 bytes are in
big-endian order?

A: **Yes, UTF-8 can contain a BOM**. However, it makes no difference as
to the endianness of the byte stream. UTF-8 always has the same byte
order. An initial BOM is only used as a signature � an indication that
an otherwise unmarked text file is in UTF-8."

--
Gordon Tisher
http://balafon.net

bflm

unread,
Dec 16, 2010, 5:23:19 PM12/16/10
to golang-nuts
Any Unicode code point can be encoded in UTF-8, so BOM is structually
valid in UTF-8. BOM is semantically invalid in UTF8 as UTf8 has no
byte order.

Russ Cox

unread,
Dec 16, 2010, 5:24:19 PM12/16/10
to Gordon Tisher, Rob 'Commander' Pike, golang-nuts
> A: **Yes, UTF-8 can contain a BOM**. However, it makes no difference as
> to the endianness of the byte stream. UTF-8 always has the same byte
> order. An initial BOM is only used as a signature — an indication that

> an otherwise unmarked text file is in UTF-8."

Go source code is not an otherwise unmarked text file.
It is a file named *.go, and all Go files must be UTF-8.
There is no need for the BOM.

Frankly, it's a bizarre convention to litter otherwise
ordinary files with byte order marks when the encoding
used in the file has only one byte order.

Russ

Gordon Tisher

unread,
Dec 16, 2010, 5:34:38 PM12/16/10
to r...@golang.org, Go Language List

Some editors are used to edit more than just .go files, and use the BOM
to both distinguish Unicode in general from other encodings, and between
UTF-8, UTF-16 and UTF-32. Given that they do so, it would be nice for
Go to just silently ignore the BOM, thus enabling people to edit
different kinds of files in their favorite editor without constantly
tweaking their settings.

Rob 'Commander' Pike

unread,
Dec 16, 2010, 5:37:15 PM12/16/10
to r...@golang.org, Gordon Tisher, golang-nuts
Another point: I'm afraid of opening the door to variant encodings by doing this. If UTF-8-encoded BOMs are legal, then why not encodings in which BOMs make sense? And if we let in BOMs, presumably we must make sure they're correct.

I think it's a really bad idea to let them in. The Windows compiler might have to, because Windows doesn't get UTF-8 right at all, but it would be a mistake to enable them everywhere.

-rob

Andrew Gerrand

unread,
Dec 16, 2010, 6:23:58 PM12/16/10
to Gordon Tisher, r...@golang.org, Go Language List
On 17 December 2010 09:34, Gordon Tisher <gor...@balafon.net> wrote:
> Some editors are used to edit more than just .go files, and use the BOM
> to both distinguish Unicode in general from other encodings, and between
> UTF-8, UTF-16 and UTF-32.  Given that they do so, it would be nice for
> Go to just silently ignore the BOM, thus enabling people to edit
> different kinds of files in their favorite editor without constantly
> tweaking their settings.

Another approach is to put this before the compile stage of your build process:
http://www.ueber.net/who/mjl/projects/bomstrip/

Andrew

Andrew Gerrand

unread,
Dec 16, 2010, 7:29:58 PM12/16/10
to Gordon Tisher, r...@golang.org, Go Language List

Here's a Go version of bomstrip you could use:

package main

import (
"io"
"log"
"os"
)

func main() {
b := make([]byte, 3)
n, err := os.Stdin.Read(b)
if err != nil && err != os.EOF {
log.Exit(err)
}
if n > 0 {
if string(b) != "\xef\xbb\xbf" {
os.Stdout.Write(b[:n])
}
if err != os.EOF {
io.Copy(os.Stdout, os.Stdin)
}
}
}

Attila Tajti

unread,
Dec 17, 2010, 2:44:15 AM12/17/10
to Rob 'Commander' Pike, r...@golang.org, Gordon Tisher, golang-nuts

On 16 Dec 2010, at 23:37, Rob 'Commander' Pike wrote:

I think it's a really bad idea to let them in.  The Windows compiler might have to, because Windows doesn't get UTF-8 right at all, but it would be a mistake to enable them everywhere.

There seems to be two distinct issues with UTF-8 in Windows:

1. Crappy tools writing/requiring BOMs in UTF-8 text files

2. Windows API is problematic with UTF-8 (especially on Windows XP), one ought to use UTF-16 wherever possible

I do not think they have anything to do with each other. And for the record, spending most of my development time on Windows I have no problem with issue 1. Good tools even on Windows can (and should) identify 8-bit encodings without the need for a BOM.

-- Attila

gzeljko

unread,
Dec 17, 2010, 5:24:41 AM12/17/10
to golang-nuts


On 16 дец, 23:24, Russ Cox <r...@golang.org> wrote:
>
> Go source code is not an otherwise unmarked text file.
> It is a file named *.go, and all Go files must be UTF-8.
> There is no need for the BOM.
>

Ok, I reconsidered.
It was just first time I saw something supports UTF-8 but not BOM. I
solved unexpected
problem with my favorite editor :)
No big deal, maybe better not to clutter things

Uriel

unread,
Dec 17, 2010, 7:23:27 AM12/17/10
to Gordon Tisher, r...@golang.org, Go Language List
On Thu, Dec 16, 2010 at 11:34 PM, Gordon Tisher <gor...@balafon.net> wrote:
> Some editors are used to edit more than just .go files, and use the BOM
> to both distinguish Unicode in general from other encodings, and between
> UTF-8, UTF-16 and UTF-32.  Given that they do so, it would be nice for
> Go to just silently ignore the BOM, thus enabling people to edit
> different kinds of files in their favorite editor without constantly
> tweaking their settings.

Any text editor that adds a BOM to UTF-8 files is broken, if that is a
problem, report it to the author of such a text editor.

That the Unicode folks took a perfectly sane UTF-8 standard, and
decided to allow an abomination like the BOM shows that one can trust
standard bodies to always, always fuck everything up.

uriel

chris dollin

unread,
Dec 17, 2010, 7:34:51 AM12/17/10
to Uriel, Gordon Tisher, r...@golang.org, Go Language List
On 17 December 2010 12:23, Uriel <ur...@berlinblue.org> wrote:
> On Thu, Dec 16, 2010 at 11:34 PM, Gordon Tisher <gor...@balafon.net> wrote:
>> Some editors are used to edit more than just .go files, and use the BOM
>> to both distinguish Unicode in general from other encodings, and between
>> UTF-8, UTF-16 and UTF-32.  Given that they do so, it would be nice for
>> Go to just silently ignore the BOM, thus enabling people to edit
>> different kinds of files in their favorite editor without constantly
>> tweaking their settings.
>
> Any text editor that adds a BOM to UTF-8 files is broken, if that is a
> problem, report it to the author of such a text editor.

By a strange coincidence I have in my other hand a report about
the Jena Turtle parser's non-support for a BOM, Turtle having
mandatory UTF-8 encoding. That user says:

> The data files are coming from my software which is all written
> in .Net and when outputting in UTF-8 the default behaviour of .Net
> is to include the BOM at the start of the file.

So it may not be as easy as reporting it to the author of "the" text
editor ...

Chris

--
Chris "No BOM today. BOM tomorrow?" Ivanova Dollin

Uriel

unread,
Dec 17, 2010, 7:52:37 AM12/17/10
to chris dollin, Gordon Tisher, r...@golang.org, Go Language List
On Fri, Dec 17, 2010 at 1:34 PM, chris dollin <ehog....@googlemail.com> wrote:
> On 17 December 2010 12:23, Uriel <ur...@berlinblue.org> wrote:
>> On Thu, Dec 16, 2010 at 11:34 PM, Gordon Tisher <gor...@balafon.net> wrote:
>>> Some editors are used to edit more than just .go files, and use the BOM
>>> to both distinguish Unicode in general from other encodings, and between
>>> UTF-8, UTF-16 and UTF-32.  Given that they do so, it would be nice for
>>> Go to just silently ignore the BOM, thus enabling people to edit
>>> different kinds of files in their favorite editor without constantly
>>> tweaking their settings.
>>
>> Any text editor that adds a BOM to UTF-8 files is broken, if that is a
>> problem, report it to the author of such a text editor.
>
> By a strange coincidence I have in my other hand a report about
> the Jena Turtle parser's non-support for a BOM, Turtle having
> mandatory UTF-8 encoding. That user says:
>
>> The data files are coming from my software which is all written
>> in .Net and when outputting in UTF-8 the default behaviour of .Net
>> is to include the BOM at the start of the file.

I'm speechless...

> So it may not be as easy as reporting it to the author of "the" text
> editor ...

Seems like the only decent solution will be to have somebody drop a
'physics package' on Redmond.

uriel

P.S.: I miss boyd :(

chris dollin

unread,
Dec 17, 2010, 7:58:41 AM12/17/10
to Uriel, Gordon Tisher, r...@golang.org, Go Language List
On 17 December 2010 12:52, Uriel <ur...@berlinblue.org> wrote:
> On Fri, Dec 17, 2010 at 1:34 PM, chris dollin <ehog....@googlemail.com> wrote:

>>> The data files are coming from my software which is all written
>>> in .Net and when outputting in UTF-8 the default behaviour of .Net
>>> is to include the BOM at the start of the file.
>
> I'm speechless...

Well, just because that user says it doesn't mean it's true.
Maybe it's just their local configuration. Maybe they've
misinterpreted something. Maybe they mailed in from the
Weirdzo universe.

Chris

--
Chris "allusive" Dollin

Attila Tajti

unread,
Dec 17, 2010, 8:09:47 AM12/17/10
to chris dollin, Uriel, Gordon Tisher, r...@golang.org, Go Language List

Or they are processing a source UTF-8 file that already has a BOM in it.
According to [1] .Net tries to hide BOMs as much as it can: the string with a BOM
is equal to the same string without the BOM, and the BOM does not appear
in the debugger or normal console output...

[1] http://chriscant.phdcc.com/2010/02/systemstring-hidden-utf8-bom.html

gzeljko

unread,
Dec 17, 2010, 9:55:55 AM12/17/10
to golang-nuts


On 17 дец, 13:23, Uriel <ur...@berlinblue.org> wrote:
>
> Any text editor that adds a BOM to UTF-8 files is broken, if that is a
> problem, report it to the author of such a text editor.
>

It was not about editor voluntary adding BOM,but how it is
supposed for editor to recognise UTF-8.
I agree *.go=UTF-8 is reasonable approach. Good editor should
support that somehow.
Reply all
Reply to author
Forward
0 new messages