tdom again - BOM and/or encoding problems & how to create <?xml ...?> node?

162 views
Skip to first unread message

MartinLemburg@Siemens-PLM

unread,
May 22, 2008, 6:33:08 AM5/22/08
to
Hi,

perhabs this is already solved ... and if so ... please let me know!

I have foreign XML files to parse and those files can, but need not,
have the BOM and non-UTF-8 encodings.

The problem with one UTF-16 encoded XML file is:

1. I opened the file:

set fd [open $fileName r];

2. I set the file pointer to the first character after the BOM:

seek $fd 2 start

3. I wanted dom to parse the file contents:

set doc [dom parse -channel $fd];

This failed with the the error: "not well-formed (invalid token)" at
line 1 character 1.

I understand this, because the parse may get as first token "< ? x m
l ..."

4. I read the file with the encoding unicode:

seek $fd 2
fconfigure $fd -encoding unicode
set doc [dom parse -channel $fd]

This failed with the error: "encoding specified in XML declaration is
incorrect" at line 1 character 30
I understand this error, because reading the channel does not return
UTF-16 data, but normal UTF-8 data, but the encoding in the XML
declaration is UTF-16.

So ... how to read/parse an foreign XML file?

Detect the BOM/encoding and fconfigure the channel, read the channel
cut of the BOM and the XML declaration and let dom parse the rest?

What's the best, most practised way?

And the second question is ... how to create the XML declaration with
(t)dom?

If I serialize my dom tree via "$doc asXML", than I don't get a XML
Declaration!
Have I to write my own to the output channel:

set fd [open ... w]
puts $fd "<?xml ...?>"
$doc asXML -channel $fd"
close $fd

Thanks in advance for any hint, suggestion, etc.!

Best regards,

Martin

schlenk

unread,
May 22, 2008, 6:49:30 AM5/22/08
to
MartinLemburg@Siemens-PLM wrote:
> Hi,
>
> perhabs this is already solved ... and if so ... please let me know!
>
> I have foreign XML files to parse and those files can, but need not,
> have the BOM and non-UTF-8 encodings.
>
you want:
package require tdom
set chan [tDOM::xmlOpenFile("file.xml")]
set doc [dom parse -channel $chan]

Michael

Rolf Ade

unread,
May 27, 2008, 6:45:01 PM5/27/08
to
MartinLemburg@Siemens-PLM wrote:
>
>I have foreign XML files to parse and those files can, but need not,
>have the BOM and non-UTF-8 encodings.

Michael is right, as usual, but does a bit too much python atm
;-). tDOM::xmlOpenFile and tDOM::xmlReadFile are your friend.

tDOM::xmlOpenFile expects a filename and returns a file channel
handle, which is readily fconfigure'd and seek'ed to get feeded into a
dom parse -channel ... Please note, that the proc open a channel and
returns that. That channel will not magically go away, if you're done
with it. It's your responsibility to close that channel, if you don't
need them anymore. So, a typical use pattern (sure, not the only) is

set xmlfd [tDOM::xmlOpenFile $filename]
set doc [dom parse -channel $xmlfd]
close $xmlfd

tDOM::xmlReadFile is just a wrapper around tDOM::xmlOpenFile. The
pattern is

set doc [dom parse [tDOM::xmlReadFile $filename]]

and you're done. No leaking file channels, filename in, DOM tree out.

There's more to it. Both are tcl procs, part of the tdom.tcl lib,
which is installed together with tdom. That a look at it, it deals
with exactly the problem, you describe.

>And the second question is ... how to create the XML declaration with
>(t)dom?
>
>If I serialize my dom tree via "$doc asXML", than I don't get a XML
>Declaration!

Sure.

>Have I to write my own to the output channel:
>
> set fd [open ... w]
> puts $fd "<?xml ...?>"
> $doc asXML -channel $fd"
> close $fd

Of course. If you even want to preserve the found encoding, you can do
something like

set doc [dom parse [tDOM::xmlReadFile $filename encStr]]
set outfd [open $outfilename w+]
fconfigure $outfd -encoding [tDOM::IANAEncoding2TclEncoding $encStr]
puts $outfd "<?xml version='1.0' encoding='$encStr'?>"
puts $outfd [$doc asXML]
close $outfd
$doc delete

rolf

MartinLemburg@Siemens-PLM

unread,
May 28, 2008, 4:28:46 AM5/28/08
to
Hi Rolf,

thanks for your detailed explaination!

The only thing I now miss is the creation of the BOM at the beginning
of a read UTF-16 and then again written UTF-16 XML file.

One of the files I want to read has the BOM and the UTF-16 encoding.
Writing the file with your provide tcl example writes a correct UTF-16
XML file, but without the BOM, so that other editors have problems
with the UTF-16 text.

I'll take a look at the tcl wiki, if there is already code to create a
BOM.

But perhabs a proc tdom::xmlWriteFile would be quite nice, where all
of your tcl example and the leading BOM could be done.

Best regards,

Martin

suchenwi

unread,
May 28, 2008, 10:23:47 AM5/28/08
to
On 28 Mai, 10:28, "MartinLemburg@Siemens-PLM"

<martin.lemburg.siemens-...@gmx.net> wrote:
> I'll take a look at the tcl wiki, if there is already code to create a
> BOM.

A BOM (Byte Order Mark) is just the Unicode U+FEFF. As FFFE (which you
would get when reading 16-bit words in wrong byte order) is not a
valid Unicode, this marks files in the right byte order. (A reader
program could swap the bytes if it runs into FFFE - see http://wiki.tcl.tk/517
).

And here's the magic how to write a BOM (after opening a file for
writing):

puts -nonewline $outfd \uFEFF

:^)

Reply all
Reply to author
Forward
0 new messages