perhabs this is already solved ... and if so ... please let me know!
I have foreign XML files to parse and those files can, but need not,
have the BOM and non-UTF-8 encodings.
The problem with one UTF-16 encoded XML file is:
1. I opened the file:
set fd [open $fileName r];
2. I set the file pointer to the first character after the BOM:
seek $fd 2 start
3. I wanted dom to parse the file contents:
set doc [dom parse -channel $fd];
This failed with the the error: "not well-formed (invalid token)" at
line 1 character 1.
I understand this, because the parse may get as first token "< ? x m
l ..."
4. I read the file with the encoding unicode:
seek $fd 2
fconfigure $fd -encoding unicode
set doc [dom parse -channel $fd]
This failed with the error: "encoding specified in XML declaration is
incorrect" at line 1 character 30
I understand this error, because reading the channel does not return
UTF-16 data, but normal UTF-8 data, but the encoding in the XML
declaration is UTF-16.
So ... how to read/parse an foreign XML file?
Detect the BOM/encoding and fconfigure the channel, read the channel
cut of the BOM and the XML declaration and let dom parse the rest?
What's the best, most practised way?
And the second question is ... how to create the XML declaration with
(t)dom?
If I serialize my dom tree via "$doc asXML", than I don't get a XML
Declaration!
Have I to write my own to the output channel:
set fd [open ... w]
puts $fd "<?xml ...?>"
$doc asXML -channel $fd"
close $fd
Thanks in advance for any hint, suggestion, etc.!
Best regards,
Martin
Michael
Michael is right, as usual, but does a bit too much python atm
;-). tDOM::xmlOpenFile and tDOM::xmlReadFile are your friend.
tDOM::xmlOpenFile expects a filename and returns a file channel
handle, which is readily fconfigure'd and seek'ed to get feeded into a
dom parse -channel ... Please note, that the proc open a channel and
returns that. That channel will not magically go away, if you're done
with it. It's your responsibility to close that channel, if you don't
need them anymore. So, a typical use pattern (sure, not the only) is
set xmlfd [tDOM::xmlOpenFile $filename]
set doc [dom parse -channel $xmlfd]
close $xmlfd
tDOM::xmlReadFile is just a wrapper around tDOM::xmlOpenFile. The
pattern is
set doc [dom parse [tDOM::xmlReadFile $filename]]
and you're done. No leaking file channels, filename in, DOM tree out.
There's more to it. Both are tcl procs, part of the tdom.tcl lib,
which is installed together with tdom. That a look at it, it deals
with exactly the problem, you describe.
>And the second question is ... how to create the XML declaration with
>(t)dom?
>
>If I serialize my dom tree via "$doc asXML", than I don't get a XML
>Declaration!
Sure.
>Have I to write my own to the output channel:
>
> set fd [open ... w]
> puts $fd "<?xml ...?>"
> $doc asXML -channel $fd"
> close $fd
Of course. If you even want to preserve the found encoding, you can do
something like
set doc [dom parse [tDOM::xmlReadFile $filename encStr]]
set outfd [open $outfilename w+]
fconfigure $outfd -encoding [tDOM::IANAEncoding2TclEncoding $encStr]
puts $outfd "<?xml version='1.0' encoding='$encStr'?>"
puts $outfd [$doc asXML]
close $outfd
$doc delete
rolf
thanks for your detailed explaination!
The only thing I now miss is the creation of the BOM at the beginning
of a read UTF-16 and then again written UTF-16 XML file.
One of the files I want to read has the BOM and the UTF-16 encoding.
Writing the file with your provide tcl example writes a correct UTF-16
XML file, but without the BOM, so that other editors have problems
with the UTF-16 text.
I'll take a look at the tcl wiki, if there is already code to create a
BOM.
But perhabs a proc tdom::xmlWriteFile would be quite nice, where all
of your tcl example and the leading BOM could be done.
Best regards,
Martin
A BOM (Byte Order Mark) is just the Unicode U+FEFF. As FFFE (which you
would get when reading 16-bit words in wrong byte order) is not a
valid Unicode, this marks files in the right byte order. (A reader
program could swap the bytes if it runs into FFFE - see http://wiki.tcl.tk/517
).
And here's the magic how to write a BOM (after opening a file for
writing):
puts -nonewline $outfd \uFEFF
:^)