How to display and remove BOM in utf-8 encoded file

Carlo Trimarchi

unread,

Aug 9, 2011, 7:37:34 AM8/9/11

to vim...@googlegroups.com

Hi,
I developed a website with Vim, working both on linux and windows and
never had any problems. The other day someone else needed to edit some
files and tried to use Mac and Windows. Apparently in the files he
edited there is this Byte-Order Mark. I discovered this only via the
w3c validator that gave me this warning:

"Byte-Order Mark found in UTF-8 File. The Unicode Byte-Order Mark
(BOM) in UTF-8 encoded files is known to cause problems for some text
editors and older browsers. You may want to consider avoiding its use
until it is better supported."

The only way I could solve the problem was using notepad++ which has
an option to explicitly save the file without the BOM. Is there a way
to do the same thing in Vim? Maybe even to display this BOM?

Thanks,
Carlo

Neil Bird

unread,

Aug 9, 2011, 8:40:55 AM8/9/11

to vim...@googlegroups.com

Around about 09/08/11 12:37, Carlo Trimarchi typed ...

> The only way I could solve the problem was using notepad++ which has
> an option to explicitly save the file without the BOM. Is there a way
> to do the same thing in Vim? Maybe even to display this BOM?

:set bomb?

Do ':set nobomb' before saving to remove a BOM.

--
[neil@fnx ~]# rm -f .signature
[neil@fnx ~]# ls -l .signature
ls: .signature: No such file or directory
[neil@fnx ~]# exit

Tony Mechelynck

unread,

Aug 9, 2011, 11:13:44 AM8/9/11

to vim...@googlegroups.com, Carlo Trimarchi

On 09/08/11 13:37, Carlo Trimarchi wrote:
> Hi,
> I developed a website with Vim, working both on linux and windows and
> never had any problems. The other day someone else needed to edit some
> files and tried to use Mac and Windows. Apparently in the files he
> edited there is this Byte-Order Mark. I discovered this only via the
> w3c validator that gave me this warning:
>
> "Byte-Order Mark found in UTF-8 File. The Unicode Byte-Order Mark
> (BOM) in UTF-8 encoded files is known to cause problems for some text
> editors and older browsers. You may want to consider avoiding its use
> until it is better supported."

That message is outdated. The BOM is supported in all Unicode encodings
including UTF-8 by all "reasonably recent" browers. It is also part of
the HTML standard. Some text editors (such as Notepad, I think) choke on
it, but the answer to that is to use a better editor, such as Vim or
even WordPad, which know about the BOM and handle it correctly, even in
UTF-8.

For some other kinds of text files (most source files and shell scripts,
for instance), it is better to save the file without a BOM, but for
momst "web" formats including HTML, CSS, and, I think, XML, XHTML, etc.,
a BOM is no problem and can even be a help (e.g. in case the web server
sets the charset incorrectly or not at all in its Content-Type header).

>
> The only way I could solve the problem was using notepad++ which has
> an option to explicitly save the file without the BOM. Is there a way
> to do the same thing in Vim? Maybe even to display this BOM?
>
> Thanks,
> Carlo
>

To save the file without a BOM:

:setlocal nobomb
:w

To ask Vim if there is a BOM:

:setlocal bomb?

The answer is bomb for "BOM present" or nobomb for "BOM absent".

Note that regardless of the state of the 'bomb' option, a BOM can only
exist if the 'fileencoding' is one of UTF-8, UTF-16 (or its UCS-2
subset) or UTF-16 (aka UCS-4), any of them (other than UTF-8 for which
endianness is not relevant) in any endianness. For other 'fileencoding'
values the 'bomb' option is irrelevant.

To display the presence or absence of the BOM on the status line:

see http://vim.wikia.com/wiki/Show_fileencoding_and_bomb_in_the_status_line

Best regards,
Tony.
--
George Orwell was an optimist.

Christian Brabandt

unread,

Aug 9, 2011, 12:11:27 PM8/9/11

to vim...@googlegroups.com

On Tue, August 9, 2011 5:13 pm, Tony Mechelynck wrote:
> To save the file without a BOM:
>
> :setlocal nobomb
> :w

:w ++bin
should also work IIRC.

regards,
Christian

Carlo Trimarchi

unread,

Aug 9, 2011, 1:36:14 PM8/9/11

to Tony Mechelynck, vim...@googlegroups.com

On 9 August 2011 17:13, Tony Mechelynck <antoine.m...@gmail.com> wrote:

> That message is outdated. The BOM is supported in all Unicode encodings
> including UTF-8 by all "reasonably recent" browers. It is also part of the
> HTML standard.

Well, with the BOM the whole layout of the website appeared broken in
Internet Explorer 7. No problem with Firefox. Still it seems is not an
issue to understimate.

> For some other kinds of text files (most source files and shell scripts, for
> instance), it is better to save the file without a BOM, but for momst "web"
> formats including HTML, CSS, and, I think, XML, XHTML, etc., a BOM is no
> problem and can even be a help (e.g. in case the web server sets the charset
> incorrectly or not at all in its Content-Type header).

It was a php file, so maybe that's problem.

> To save the file without a BOM:
>
> :setlocal nobomb
> :w
>
> To ask Vim if there is a BOM:
>
> :setlocal bomb?
>
> The answer is bomb for "BOM present" or nobomb for "BOM absent".
>
>

> To display the presence or absence of the BOM on the status line:
>
> see
> http://vim.wikia.com/wiki/Show_fileencoding_and_bomb_in_the_status_line

Thanks for all the info and the commands. Very useful.

Ben Fritz

unread,

Aug 9, 2011, 5:54:08 PM8/9/11

to vim_use

On Aug 9, 10:13 am, Tony Mechelynck <antoine.mechely...@gmail.com>
wrote:

> On 09/08/11 13:37, Carlo Trimarchi wrote:
>
> > Hi,
> > I developed a website with Vim, working both on linux and windows and
> > never had any problems. The other day someone else needed to edit some
> > files and tried to use Mac and Windows. Apparently in the files he
> > edited there is this Byte-Order Mark. I discovered this only via the
> > w3c validator that gave me this warning:
>
> > "Byte-Order Mark found in UTF-8 File. The Unicode Byte-Order Mark
> > (BOM) in UTF-8 encoded files is known to cause problems for some text
> > editors and older browsers. You may want to consider avoiding its use
> > until it is better supported."
>
> That message is outdated. The BOM is supported in all Unicode encodings
> including UTF-8 by all "reasonably recent" browers. It is also part of
> the HTML standard. Some text editors (such as Notepad, I think) choke on
> it, but the answer to that is to use a better editor, such as Vim or
> even WordPad, which know about the BOM and handle it correctly, even in
> UTF-8.
>

Not true. W3C still explicitly recommends against using a BOM for
UTF-8 (but I don't remember the link off-hand, sorry, I think it was
either in the HTML4.01 or HTML5 spec somewhere). Even modern browsers
like Firefox and Opera choke on a BOM in UTF-8 files for XHTML served
as XML. Using a BOM for UTF-8 on the internet is a bad idea.

A BOM is however recommended and useful on UTF-16 or UTF-32 and the
like.

pansz

unread,

Aug 9, 2011, 8:18:43 PM8/9/11

to vim...@googlegroups.com

On Tue, Aug 9, 2011 at 11:13 PM, Tony Mechelynck
<antoine.m...@gmail.com> wrote:
>
> That message is outdated. The BOM is supported in all Unicode encodings
> including UTF-8 by all "reasonably recent" browers. It is also part of the
> HTML standard.

BOM is a standard for UCS2 or UTF-16, not for UTF-8.

BOM for utf-8 will cause problem for most programs which expect text
streams. gcc is a good example, most GNU CLI utilities will reject
utf-8 with BOM.

And, W3C validator will of course complain about it...

Tony Mechelynck

unread,

Aug 10, 2011, 7:19:30 AM8/10/11

to vim...@googlegroups.com, pansz

On 10/08/11 02:18, pansz wrote:
> On Tue, Aug 9, 2011 at 11:13 PM, Tony Mechelynck
> <antoine.m...@gmail.com> wrote:
>>
>> That message is outdated. The BOM is supported in all Unicode encodings
>> including UTF-8 by all "reasonably recent" browers. It is also part of the
>> HTML standard.
>
> BOM is a standard for UCS2 or UTF-16, not for UTF-8.

According to the Unicode FAQ,
http://www.unicode.org/faq//utf_bom.html#bom4 (two successive FAQ
questions) a BOM can be used in UTF-8 as well as in UTF-16 or UTF-32;
but since UTF-8 doesn't have endianness variants, with UTF-8 it
specifies encoding only, not endianness. BTW, "good" editors (including
at least Vim and WordPad, possibly others) handle the BOM correctly,
even in UTF-8. In fact, in my experience WordPad won't read UTF-8 text
correctly _unless_ there is a BOM.

However (about your next paragraph), when UTF-8 is fed "transparently"
to a program which expects ASCII, and in particular to any program which
expects #! at the start of a file, the BOM should not be used (see the
2nd FAQ question linked above, and also
http://www.unicode.org/faq//utf_bom.html#bom10 "How I should deal with
BOMs?", point 3.

>
> BOM for utf-8 will cause problem for most programs which expect text
> streams. gcc is a good example, most GNU CLI utilities will reject
> utf-8 with BOM.

I explicitly mentioned in the part you snipped that for some other kinds
of text than HTML or CSS (such as, I said, source files and shell
scripts) it is better to save the file without a BOM.

>
> And, W3C validator will of course complain about it...
>

...with a warning, not an error; and Tidy won't.

Best regards,
Tony.
--
"My weight is perfect for my height -- which varies"

Ben Fritz

unread,

Aug 10, 2011, 11:30:38 AM8/10/11

to vim_use

On Aug 10, 6:19 am, Tony Mechelynck <antoine.mechely...@gmail.com>
wrote:

> On 10/08/11 02:18, pansz wrote:
>
> > On Tue, Aug 9, 2011 at 11:13 PM, Tony Mechelynck

> > <antoine.mechely...@gmail.com> wrote:

>
> >> That message is outdated. The BOM is supported in all Unicode encodings
> >> including UTF-8 by all "reasonably recent" browers. It is also part of the
> >> HTML standard.
>
> > BOM is a standard for UCS2 or UTF-16, not for UTF-8.
>

> According to the Unicode FAQ,http://www.unicode.org/faq//utf_bom.html#bom4(two successive FAQ

> questions) a BOM can be used in UTF-8 as well as in UTF-16 or UTF-32;
> but since UTF-8 doesn't have endianness variants, with UTF-8 it
> specifies encoding only, not endianness. BTW, "good" editors (including
> at least Vim and WordPad, possibly others) handle the BOM correctly,
> even in UTF-8. In fact, in my experience WordPad won't read UTF-8 text
> correctly _unless_ there is a BOM.
>
> However (about your next paragraph), when UTF-8 is fed "transparently"
> to a program which expects ASCII, and in particular to any program which
> expects #! at the start of a file, the BOM should not be used (see the

> 2nd FAQ question linked above, and alsohttp://www.unicode.org/faq//utf_bom.html#bom10"How I should deal with

> BOMs?", point 3.
>
>
>
> > BOM for utf-8 will cause problem for most programs which expect text
> > streams. gcc is a good example, most GNU CLI utilities will reject
> > utf-8 with BOM.
>
> I explicitly mentioned in the part you snipped that for some other kinds
> of text than HTML or CSS (such as, I said, source files and shell
> scripts) it is better to save the file without a BOM.
>
>
>
> > And, W3C validator will of course complain about it...
>
> ...with a warning, not an error; and Tidy won't.
>

W3C specifically recommends you do NOT use a BOM for UTF-8 on HTML/
XHTML/CSS documents. See http://www.w3.org/International/questions/qa-byte-order-mark#bomhow

While developing TOhtml, I ran into problems in some browsers when
using UTF-8 with BOM. If I remember correctly, browsers which actually
handle XHTML correctly, like Opera and Firefox, were interpreting the
BOM as characters appearing before the XML prolog <?xml..., which
makes the XML be not well-formed and therefore (somewhat correctly)
the browser bailed without rendering anything. Re-parsing the document
as HTML of course may allow these browsers to render the document
correctly, but according to the W3C link above, some user agents will
still have problems and attempt to reder characters instead of
treating it as an invisible BOM.

For this reason, syntax/2html contains (after opening the buffer for
the generated file):

" According to http://www.w3.org/TR/html4/charset.html#doc-char-set,
the byte
" order mark is highly recommend on the web when using multibyte
encodings. But,
" it is not a good idea to include it on UTF-8 files. Otherwise, let
Vim
" determine when it is actually inserted.
if s:settings.vim_encoding == 'utf-8'
setlocal nobomb
else
setlocal bomb
endif

Alessandro Antonello

unread,

Aug 10, 2011, 6:03:36 PM8/10/11

to vim...@googlegroups.com

May I add some observation to this discution?

The better way to use BOM is when you know your target. I work in a MacBook
which has UTF-8 as default. When I'm working with Objective-C that will be
compiled using LLVM there is no problem using BOM (which is a good thing since
the encoding can be easily recognized). But when I'm working with Java, doing
something for the Android platform, I use ISO-8859-1 because the Google guys
had defined the 'encoding' argument of the 'javac' compiler as 'ASCII' in an
ANT XML somewhere.

I known, also, that PHP doesn't handle BOM well. So I decided to work with PHP
also in ISO-8859-1. But, my e-mails are all HTML formated using UTF-8 with BOM
(edited on VIM), always seen in Firefox, Safari or Chrome with no problems.

I believe that the problem with major browsers is in respect with user
configuration. You can left the browser discover the character set of a page
or configure it to use one based in the assumption that you are in an
occidental country (or another part of the world). This causes no problems if
you don't open pages from another countries. In the current days, is
preferable if you let the browser handle the encoding it self.

Regards.

Tony Mechelynck

unread,

Aug 13, 2011, 4:49:14 PM8/13/11

to vim...@googlegroups.com, Alessandro Antonello

Yeah, the idea is to know what your file will be used with.

Recently I discovered that when feeding a local *.txt file to SeaMonkey
(or, I suppose, Firefox), it will try to read it as Latin1 unless there
is a BOM. I'm not sure if that depends on my Appearance preferences. Of
course, for a *.txt on my local disk there is no metadata (no HTTP
headers etc.) to tell the MIME type and the encoding to the browser. For
the MIME type, *.txt means text/plain but it could be any charset.

This means that when I want to display (and possibly print) multilingual
text (let's say, who knows? maybe a *.txt file in French with some
Russian and some Hebrew in it), something Gecko (the display engine used
by Firefox, Thunderbird and SeaMonkey) does better than gvim, I'll have
to record it with a BOM.

OTOH any file starting with #! MUST, as has already been said, be
recorded with no BOM because the shebang is only looked for in the first
two bytes of the file (which would be part of the BOM if there were one).

Best regards,
Tony.
--
hundred-and-one symptoms of being an internet addict:
156. You forget your friend's name but not her e-mail address.

Reply all

Reply to author

Forward