UTF8 + BOM

250 views
Skip to first unread message

Noah Slater

unread,
Oct 21, 2009, 9:52:31 PM10/21/09
to asciidoc
Hey,

I you use a BOM with a UTF8 document, which is the first octet,
AsciiDoc puts the BOM character into the <title> element of the
DocBook output. This is incorrect.

Thanks,

Noah

Stuart Rackham

unread,
Oct 21, 2009, 11:36:58 PM10/21/09
to asci...@googlegroups.com
Hi Noah

Could you post an example.

While we're on the subject, a UTF-8 BOM is allowed by the Unicode standard, but
it is not recommended ("Use of a BOM is neither required nor recommended for
UTF-8, but may be encountered in contexts where UTF-8 data is converted from
other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature").


Cheers, Stuart


>
> Thanks,
>
> Noah
> >
>

Yves-Alexis Perez

unread,
Oct 22, 2009, 2:08:45 AM10/22/09
to asci...@googlegroups.com
On jeu., 2009-10-22 at 16:36 +1300, Stuart Rackham wrote:
> Could you post an example.

In fact I had the problem too. Some crappy editor (like notepad++) may
add that BOM as the first character.

If one uses the “short” title (= My document) asciidoc doesn't correctly
parse it because the first character in the line is not the =.

It'd be nice if asciidoc could skip it (may save time for people because
it might be hard to know what happens) but it's obviously a problem in
the editor and corner-casing everything might a be a little hard
eventually.

Cheers,
--
Yves-Alexis

Stuart Rackham

unread,
Oct 22, 2009, 2:48:23 AM10/22/09
to asci...@googlegroups.com

I not clear what the problem is, are you saying that it's caused by an editor
automatically inserting a BOM character at the start of a document?
How do you you repeat the problem?

Cheers, Stuart


>
> Cheers,

Yves-Alexis Perez

unread,
Oct 22, 2009, 7:38:28 AM10/22/09
to asci...@googlegroups.com
Stuart Rackham a écrit :

Yes, some editor insert a BOM in front of UTF-8 documents (notepad++
does, with the default configuration, iirc). I don't have a windows so I
can't reproduce, but I'm sure someone else could provide a legit
document (or I could just manually insert a BOM in a text document using
an hex editor).

--
Yves-Alexis

Stuart Rackham

unread,
Oct 22, 2009, 3:53:06 PM10/22/09
to asci...@googlegroups.com

I saved 3 files in MS Notepad in UTF-8, the start of the file contained the
UTF-8 BOM (ef bb bf).

If the solution is simply to strip the leading utf8 BOM from the file then this
would be straight-forward. Are there any implications to doing this?


Cheers, Stuart

Yves-Alexis Perez

unread,
Oct 22, 2009, 4:38:56 PM10/22/09
to asci...@googlegroups.com
On ven., 2009-10-23 at 08:53 +1300, Stuart Rackham wrote:
> If the solution is simply to strip the leading utf8 BOM from the file
> then this
> would be straight-forward. Are there any implications to doing this?

I don't think so. Even in the original .txt file the BOM shouldn't be
needed (afaik UTF-8 doesn't need it at all, UTF-8 is endian-indep). But
the generated file shouldn't bother at all, and the running script
either, since it'll accept valid UTF-8 correctly.

I don't exactly know what you should do when you encounter a BOM on
UTF-8 file, but “ignore it” seems valid for me.

Cheers,

--
Yves-Alexis

ak_avenger

unread,
Oct 23, 2009, 10:35:14 PM10/23/09
to asciidoc
BOMs serve no purpose in UTF-8. There IS NO Byte Order to Mark, unlike
some other encodings.

The only reasonable thing to do upon encountering one is to ignore it
(or delete it).

Many parsers choke on BOMs (pre-1.9 Ruby, some proprietary languages
I've used), and most text editors won't even display them (Windows
Notepad and SciTE, even though SciTE inserts them!).

In fact, the only "text editor" I know of which displays BOMs, and
therefore allows you to delete them, is NetBeans. So anytime I need to
strip a BOM, it takes me a good five minutes to get started.

So no, Stuart, there are no implications to ignoring it. I'm always
happy to learn something new by being proven wrong, but I doubt it
will happen in this case.

Stuart Rackham

unread,
Oct 23, 2009, 10:47:26 PM10/23/09
to asci...@googlegroups.com
Thanks for the clarification, unless anyone has reasonable objections I'll just
ignore the UTF-8 BOM in input files.

Cheers, Stuart


> >
>

Lex Trotman

unread,
Oct 23, 2009, 11:22:46 PM10/23/09
to asci...@googlegroups.com
Hi,

Just adding my $0.02 worth.

2009/10/24 Stuart Rackham <srac...@gmail.com>:
>
> ak_avenger wrote:
>> BOMs serve no purpose in UTF-8. There IS NO Byte Order to Mark, unlike
>> some other encodings.

Not totally true, although the BOM has no effect on the UTF-8 itself,
to quote Unicode.org "it can be used as a hint that an octet sequence
is in UTF-8 and not some other 8 bit encoding" (it can only be a hint
since some other 8 bit encoding could possibly have the same first
three characters)

>>
>> The only reasonable thing to do upon encountering one is to ignore it

Other than choosing to treat the stream as UTF-8 thats true.

>> (or delete it).

Depends on your environment, in mixed encoding environments it can be
useful so it should not be deleted without thought.

>>
>> Many parsers choke on BOMs (pre-1.9 Ruby, some proprietary languages
>> I've used), and most text editors won't even display them (Windows
>> Notepad and SciTE, even though SciTE inserts them!).
>>
>> In fact, the only "text editor" I know of which displays BOMs, and
>> therefore allows you to delete them, is NetBeans. So anytime I need to
>> strip a BOM, it takes me a good five minutes to get started.
>>
>> So no, Stuart, there are no implications to ignoring it. I'm always
>> happy to learn something new by being proven wrong, but I doubt it
>> will happen in this case.
>
> Thanks for the clarification, unless anyone has reasonable objections I'll just
> ignore the UTF-8 BOM in input files.

It may be useful to give a warning if a file with a BOM also has an
encoding attribute which sets the another encoding, as it is sort of
contradictory. I assume that prior to encountering the encoding
attribute Asciidoc treats the file as UTF-8 and doesn't re-read it
from the start. Encodings set on the command line should of course
treat the "BOM" as bytes in the specified encoding.

Cheers
Lex

PS I'll be happy when mixed encodings pass away but it may be a while yet

>
> Cheers, Stuart
>
>
>> >
>>
>
>
> >
>

ak_avenger

unread,
Oct 24, 2009, 1:53:14 AM10/24/09
to asciidoc
> It may be useful to give a warning if a file with a BOM also has an
> encoding attribute which sets the another encoding, as it is sort of
> contradictory.

True, might as well have a warning if things don't make sense,
especially since it won't affect the contents of the generated file.


> >> The only reasonable thing to do upon encountering one is to ignore it
>
> Other than choosing to treat the stream as UTF-8 thats true.

Well, like you quoted from Unicode.org, it still doesn't clearly
identify it as UTF-8, it's a hint. And Asciidoc already treats files
as UTF-8 by default, without a BOM (right?).


> >> (or delete it).
>
> Depends on your environment, in mixed encoding environments it can be
> useful so it should not be deleted without thought.

I'm not actually suggesting that Asciidoc should modify the source
files or anything, sorry about the lack of clarity there. Humans and
text editors should in most (all?) cases, though.

If you need a way to remember the encoding of a file, an invisible and
ambiguous character seems like a terrible choice. There are visible
and explicit ways to do this in, as far as I know, any file that
requires it.

XML has encoding declarations, and programming languages which support
mixed encodings have special comments that parsers, humans, and text
editors understand. I've never had a program crash or produce
incorrect output because of one of these comments. BOMs on the other
hand...

Lex Trotman

unread,
Oct 24, 2009, 2:10:38 AM10/24/09
to asci...@googlegroups.com
Hi Stuart,

I need to add a little to what I said below.

It has been suggested that the encoding should be chosen by (in
increasing priority):

1. locale, after all this is most probably what your editor is going
to read/write by default

2. UTF if BOM is found (8/16/32 & endianness for 16 & 32 are all
handled by Python), It is only a hint but you would be really unlucky
to find a file in another encoding starting with a BOM. The
characters chosen for the BOM and not usually text characters in any
standard encoding.

3. Explicitly specified encoding either on command line or in the file

A European friend pointed out that on his systems where there is a
large body of legacy text in some encoding, then the locale is usually
set to that, and so the editor reads and writes it unless it finds a
BOM.

Why write new files in the old encoding? Because they have to be
processed by old tools along with old files. But they can be written
in UTF-8 when possible.

This behaviour allows a slow transition from the legacy encoding
(usually cp125x or ISO8859-x series) to UTF-8 to happen. Bulk
converting is rarely viable, or possible in the presence of old tools.

Cheers
Lex

2009/10/24 Lex Trotman <ele...@gmail.com>:

Lex Trotman

unread,
Oct 24, 2009, 2:27:59 AM10/24/09
to asci...@googlegroups.com
Hi, this came in while I was writing my previous :-)

2009/10/24 ak_avenger <ak_av...@hotmail.com>:
>
>> It may be useful to give a warning if a file with a BOM also has an
>> encoding attribute which sets the another encoding, as it is sort of
>> contradictory.
>
> True, might as well have a warning if things don't make sense,
> especially since it won't affect the contents of the generated file.
>
>
>> >> The only reasonable thing to do upon encountering one is to ignore it
>>
>> Other than choosing to treat the stream as UTF-8 thats true.
>
> Well, like you quoted from Unicode.org, it still doesn't clearly
> identify it as UTF-8, it's a hint. And Asciidoc already treats files
> as UTF-8 by default, without a BOM (right?).

Yeah, but remember that Unicode.org is a standards organisation and
has to be exactly correct. In real life the BOM octets have been
chosen to not be text characters in most encodings, theoretically they
can occur, but in an actual *text* file they won't.

>
>
>> >> (or delete it).
>>
>> Depends on your environment, in mixed encoding environments it can be
>> useful so it should not be deleted without thought.
>
> I'm not actually suggesting that Asciidoc should modify the source
> files or anything, sorry about the lack of clarity there. Humans and
> text editors should in most (all?) cases, though.

Still depends on their environment, if your locale is set to
ISO8859-15 (Western Europe) then you want to mark UTF-8 files since
they are not in the "default" encoding.

>
> If you need a way to remember the encoding of a file, an invisible and
> ambiguous character seems like a terrible choice.

Yes, its the worst solution, except for all the others, I think it was
the best that the standards could do in the face of the proliferation
of other encodings in use :-)

There are visible
> and explicit ways to do this in, as far as I know, any file that
> requires it.

This is fine for the primary use of the file, eg Asciidoc reads the
:encoding: attribute and obeys it, but any other tool is not going to
know to look for it, and then they would also have to know to look for
the markings in Emacs files and Vim files and XML and .....

The BOM and locale are to provide a way of indicating the encoding to
tools that do not parse the content but still need to
display/print/read it correctly.

>
> XML has encoding declarations, and programming languages which support
> mixed encodings have special comments that parsers, humans, and text
> editors understand. I've never had a program crash or produce
> incorrect output because of one of these comments. BOMs on the other
> hand...

All depends on your locale and tool set, in ASCII speaking countries
things are usually easier than in other countries. :-)

Its a painful artifact left over from the past unfortunately and as
such hasn't a single clean solution until the entire world is Unicode,
and even that isn't totally a solution (see the Han Unification wars
for details).

Cheers
Lex

> >
>

ak_avenger

unread,
Oct 24, 2009, 3:29:37 AM10/24/09
to asciidoc
> All depends on your locale and tool set, in ASCII speaking countries
> things are usually easier than in other countries. :-)

Hey now, I wouldn't have accidentally inserted a BOM by switching to
the wrong character set if I only spoke ASCII.

If you assume some non-UTF-8 encoding by default, and you have no
explicit declaration, I guess it makes sense to take into account
whatever information you do have, and assume UTF-8 if you see a BOM.
The order of priority you gave makes sense.

Lex Trotman

unread,
Oct 24, 2009, 4:21:49 AM10/24/09
to asci...@googlegroups.com
2009/10/24 ak_avenger <ak_av...@hotmail.com>:
>
>> All depends on your locale and tool set, in ASCII speaking countries
>> things are usually easier than in other countries. :-)
>
> Hey now, I wouldn't have accidentally inserted a BOM by switching to
> the wrong character set if I only spoke ASCII.

Sorry, I meant me not you, thats why I had to consult a European
friend to get a view of the situation in the non ASCII (English) part
of the world. :-)

Cheers
Lex

Yves-Alexis Perez

unread,
Oct 24, 2009, 6:46:02 PM10/24/09
to asci...@googlegroups.com
On ven., 2009-10-23 at 19:35 -0700, ak_avenger wrote:
> In fact, the only "text editor" I know of which displays BOMs, and
> therefore allows you to delete them, is NetBeans. So anytime I need to
> strip a BOM, it takes me a good five minutes to get started.

Next time, just use an hexadecimal editor like bvi :)

--
Yves-Alexis

Stuart Rackham

unread,
Oct 25, 2009, 1:33:02 AM10/25/09
to asci...@googlegroups.com
Thanks for all the info Lex, there's a lot to this little BOM character.

Currently the AsciiDoc encoding attribute is only used to set the output markup
'encoding' attribute -- neither the input stream or the output stream is
explicitly encoded/decoded. The only place that the input stream is decoded is
to calculate the length of the title underlines.

I've made some notes to myself to address the question of explicitly setting the
input encoding but haven't had the motivation or the expertise to do anything.

Anyway, am I right in surmising the optimal immediate BOM solution is simply to
pass the BOM through to the output but not allow it to affect the first parsed
AsciiDoc element (usually the document title).


Cheers, Stuart

Lex Trotman

unread,
Oct 25, 2009, 1:56:43 AM10/25/09
to asci...@googlegroups.com
2009/10/25 Stuart Rackham <srac...@gmail.com>:
>
> Thanks for all the info Lex, there's a lot to this little BOM character.
>
> Currently the AsciiDoc encoding attribute is only used to set the output markup
> 'encoding' attribute -- neither the input stream or the output stream is
> explicitly encoded/decoded. The only place that the input stream is decoded is
> to calculate the length of the title underlines.

Ok, I was maybe reading too much into the documentation of the
encoding option, or maybe just not reading it right :-)

>
> I've made some notes to myself to address the question of explicitly setting the
> input encoding but haven't had the motivation or the expertise to do anything.

Python does of course have a whole lot of tools to address the
problem, but, like the problem I guess, they're a bit complex :-)

>
> Anyway, am I right in surmising the optimal immediate BOM solution is simply to
> pass the BOM through to the output but not allow it to affect the first parsed
> AsciiDoc element (usually the document title).

Yep, if nothing else the presence of a BOM should not invalidate
otherwise correct input.

Cheers
Lex

Stuart Rackham

unread,
Oct 26, 2009, 3:48:54 PM10/26/09
to asci...@googlegroups.com
I've added UTF-8 BOM handling to the trunk: If an AsciiDoc document file begins
with a UTF-8 BOM (byte order mark) then it is passed transparently through to
the output file. The BOM is stripped from included files:

http://hg.sharesource.org/asciidoc/rev/7b4db93aff22
Reply all
Reply to author
Forward
0 new messages