But I had again a UTF8 file (IDOC from SAP ERP software) with a BOM as
first character (resulting in following sequence when opening the file
in a hex editor: ef bb bf).
When opening the file with utf-8 encoding, we first read the BOM
character.
It is hard to recognize because it is invisible if you print it as a
string. only a "scan \ufeff %c" shows it.
This is not a bug, it is a feature, but many programs are unaware of
this fact and do not process the file:
Example: tkxmllint (http://tclxml.sourceforge.net/tkxmllint.html) ->
fails with error "wrong character"
Are there any ideas on this ?
Would it helpful to have an encodation automatically skipping the
BOM ?
The following sequence may help:
if {[string index $Data 0] eq "\ufeff"} {set Data [string range $Data
1 end]}
Thank you,
Harald
Its application dependent if you BOM for UTF-8 is an error. On the
Windows platform its the norm to use BOMs to indicate Unicode files in
UTF-8 etc., so there one might expect that a BOM would be ignored/
detected, but on others it might simply be wrong.
Its easy enough to work around it, if you know whats up, especially
due to the fact that Tcl can switch encodings in mid-stream without
big troubles, so just open the file as binary, peek at the first bytes
and decide upon the real encoding than. IIRC tdom has some code to do
that for XML files in its script library.
Michael