Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.

Dismiss

Skipping UTF8-BOM

20 views

Skip to first unread message

Harald Oehlmann

unread,

Jan 22, 2009, 4:17:00 AM1/22/09

An Unicode file may start with a byte order mark (BOM) which has
Unicode FEFF (this code in wrong byte order (FFFE) is illegal and thus
byte ordering may be detected.
This makes IMHO only sense for UTF-16 encodings.

But I had again a UTF8 file (IDOC from SAP ERP software) with a BOM as
first character (resulting in following sequence when opening the file
in a hex editor: ef bb bf).
When opening the file with utf-8 encoding, we first read the BOM
character.
It is hard to recognize because it is invisible if you print it as a
string. only a "scan \ufeff %c" shows it.

This is not a bug, it is a feature, but many programs are unaware of
this fact and do not process the file:
Example: tkxmllint (http://tclxml.sourceforge.net/tkxmllint.html) ->
fails with error "wrong character"

Are there any ideas on this ?

Would it helpful to have an encodation automatically skipping the
BOM ?

The following sequence may help:
if {[string index $Data 0] eq "\ufeff"} {set Data [string range $Data
1 end]}

Thank you,
Harald

schlenk

unread,

Jan 22, 2009, 5:06:28 AM1/22/09

On Jan 22, 10:17 am, Harald Oehlmann <hoehlm...@de.pepperl-fuchs.com>
wrote:

> An Unicode file may start with a byte order mark (BOM) which has
> Unicode FEFF (this code in wrong byte order (FFFE) is illegal and thus
> byte ordering may be detected.
> This makes IMHO only sense for UTF-16 encodings.
>
> But I had again a UTF8 file (IDOC from SAP ERP software) with a BOM as
> first character (resulting in following sequence when opening the file
> in a hex editor: ef bb bf).
> When opening the file with utf-8 encoding, we first read the BOM
> character.
> It is hard to recognize because it is invisible if you print it as a
> string. only a "scan \ufeff %c" shows it.
>
> This is not a bug, it is a feature, but many programs are unaware of
> this fact and do not process the file:
> Example: tkxmllint (http://tclxml.sourceforge.net/tkxmllint.html) ->
> fails with error "wrong character"
>
> Are there any ideas on this ?
>
> Would it helpful to have an encodation automatically skipping the
> BOM ?

Its application dependent if you BOM for UTF-8 is an error. On the
Windows platform its the norm to use BOMs to indicate Unicode files in
UTF-8 etc., so there one might expect that a BOM would be ignored/
detected, but on others it might simply be wrong.

Its easy enough to work around it, if you know whats up, especially
due to the fact that Tcl can switch encodings in mid-stream without
big troubles, so just open the file as binary, peek at the first bytes
and decide upon the real encoding than. IIRC tdom has some code to do
that for XML files in its script library.

Michael

0 new messages