UTF8 BOM Character Confuses JSON Decoder

3,027 views
Skip to first unread message

Rick

unread,
Jul 10, 2013, 8:23:53 AM7/10/13
to golan...@googlegroups.com
We have written a Go client to access weather forecasts using a JSON API. For example


It turns out that the service uses a Byte Order Mark (BOM) in the UTF8 response that confuses the Go JSON decoding package. According to the docs (http://www.unicode.org/faq/attribution.html#AF) a BOM is unnecessary and can confuse some consumers (as I discovered). But it is permitted. The work-around is simple, just check the body for the BOM and remove before processing. But it took a while to discover the problem (browsers ignore it).

I'm wondering whether it would be reasonable for the Go JSON package itself to check for the BOM. Alternatively, perhaps some mention of potential issues with services that include a BOM could be included in the Go JSON docs.

Andy Balholm

unread,
Jul 10, 2013, 11:38:31 AM7/10/13
to golan...@googlegroups.com
The Go team includes the original designers of UTF-8, and they consider BOMs an aBOMination. They are reluctant to do anything to make life easier for people who use BOMs. :-)

(Although they did make the compiler accept source files with BOMs, if I remember right.)

Arne Hormann

unread,
Jul 10, 2013, 12:12:04 PM7/10/13
to golan...@googlegroups.com
In JSON, BOM would have to be a whitespace character to be valid before the first token. See the last paragraph before the section with implementation links on json.org.
Following the unicode spec, BOM is not a whitespace character (search for "white_space").
That's why the team is right and most of the world is wrong...

Arne Hormann

unread,
Jul 10, 2013, 12:22:57 PM7/10/13
to golan...@googlegroups.com
But maybe I'm wrong here. JSON references the ECMA-spec which states in 7.1 on page 14 that <BOM> should be interpreted as a whitespace... I'll leave closer reading to somebody else as I already messed up :)

André Moraes

unread,
Jul 10, 2013, 12:46:11 PM7/10/13
to Arne Hormann, golan...@googlegroups.com
On Wed, Jul 10, 2013 at 1:22 PM, Arne Hormann <arneh...@gmail.com> wrote:
> But maybe I'm wrong here. JSON references the ECMA-spec which states in 7.1
> on page 14 that <BOM> should be interpreted as a whitespace... I'll leave
> closer reading to somebody else as I already messed up :)

JSON != EcmaScript

The JSON spec says it was derived from EcmaScript but that doesn't
mean JSON should follow rules from EcmaScript.

Your first response was right.

--
André Moraes
http://amoraes.info

Arne Hormann

unread,
Jul 10, 2013, 1:03:56 PM7/10/13
to golan...@googlegroups.com, Arne Hormann
Right, page 202, rule "JSONWhiteSpace": Tab, Carriage return, Line feed and Space only...

Rodrigo Kochenburger

unread,
Jul 10, 2013, 1:37:07 PM7/10/13
to golan...@googlegroups.com, Arne Hormann
What is the reason for a BOM in UTF-8 anyway, since it fits in one byte?

Uli Kunitz

unread,
Jul 10, 2013, 2:35:37 PM7/10/13
to golan...@googlegroups.com
It shouldn't be too difficult to write an io.Reader implementation that eats the BOM.

peterGo

unread,
Jul 10, 2013, 3:01:25 PM7/10/13
to golan...@googlegroups.com
Rick,

You posted a bad link. I think this is what you were looking for.

Unicode
Byte Order Mark (BOM) FAQ
http://www.unicode.org/faq/utf_bom.html#BOM

Q: What is a BOM?

A: A byte order mark (BOM) consists of the character code U+FEFF at the beginning of a data stream, where it can be used as a signature defining the byte order and encoding form, primarily of unmarked plaintext files. Under some higher level protocols, use of a BOM may be mandatory (or prohibited) in the Unicode data stream defined in that protocol.

Q: Can a UTF-8 data stream contain the BOM character (in UTF-8 form)? If yes, then can I still assume the remaining UTF-8 bytes are in big-endian order?

A: Yes, UTF-8 can contain a BOM. However, it makes no difference as to the endianness of the byte stream. UTF-8 always has the same byte order. An initial BOM is only used as a signature — an indication that an otherwise unmarked text file is in UTF-8. Note that some recipients of UTF-8 encoded data do not expect a BOM. Where UTF-8 is used transparently in 8-bit environments, the use of a BOM will interfere with any protocol or file format that expects specific ASCII characters at the beginning, such as the use of "#!" of at the beginning of Unix shell scripts.

Peter

John Jeffery

unread,
Nov 17, 2015, 12:42:52 AM11/17/15
to golang-nuts
I realise this is an old post, but I have had to deal with this issue myself lately. Uli mentioned that it should not be difficult to write a reader that eats BOMs. Because I have had to do this a few times now I've put together a tiny package that does just this. Available at github.com/spkg/bom.
Reply all
Reply to author
Forward
0 new messages