My needs are very simple, I just want to parse a couple of xml files
that can be found in the manifest for epub files. I have actually
tried a parsing routine I found at http://wiki.tcl.tk/3919 but it
throws an error.
I've edited that page to show the input I'm using; the xml is so
simple I could probably use a regular expression, as all I need to do
is extract the value of the rootfile tag's full-path attribute. The
file corresponding to full-path however promises to be a somewhat more
substantial xml file (despite the .opf extension) which I will in turn
need to parse.
Thanks in advance. If I botched the formatting a bit of my wiki edit
please feel free to correct it, or me!
The wiki is not the proper place for bugreports or calls for help.
Here is the place.
So I'm copying your example back here:
> I tried to use this routine on a small xml file but got the following error:
>
> list element in quotes followed by "?" instead of space
> while executing
> "lrange [string map {= " "} $item] 1 end"
> (procedure "xml2list" line 25)
> invoked from within
> "xml2list $file_data"
>
> Input was:
> <?xml version="1.0"?> ...
Clearly the problem is that the code is not prepared to handle that
special prefix, and the "?" ends up where it shouldn't. A simple fix
is to just remove it beforehand:
regsub {^\s*<[?][^?]*[?]>} $input "" input
-Alex
Thanks for your help both as to code and how to (not) use the wiki. I
will definitely try your suggested technique on the file indicated by
the full-path attribute: "OEBPS/content.opf"
In the meantime though I came up with a regex for finding that
attribute within the input; my Tcl/regex skills need some sharpening
anyway:
regexp {full-path=["'](.*?)["']}
Comparing the output of the above XML as a list to its original string
content, I wonder if it might not be more useful to convert xml to a
(nested) dictionary?
The following:
package {xmlns http://www.idpf.org/2007/opf xmlns:dc http://purl.org/dc/elements/1.1/
unique-identifier bookid version 2.0} {{me
tadata {} {{dc:title {} {{{#text} {Hello World: My First EPUB}}}}
{dc:creator {} {{{#text} {My Name}}}} {dc:identifier {id booki
d} {{{#text} urn:uuid:12345}}} {dc:language {} {{{#text} en-US}}}
{meta {name cover content cover-image} {}}}} {manifest {} {{it
em {id ncx href toc.ncx media-type text/xml} {}} {item {id cover href
title.html media-type application/xhtml+xml} {}} {item {id
content href content.html media-type application/xhtml+xml} {}} {item
{id cover-image href images/cover.png media-type image/pn
g} {}} {item {id css href stylesheet.css media-type text/css} {}}}}
{spine {toc ncx} {{itemref {idref cover linear no} {}} {item
ref {idref content} {}}}} {guide {} {{reference {href title.html type
cover title Cover} {}}}}}
Is what results from:
<?xml version='1.0' encoding='utf-8'?>
<package xmlns="http://www.idpf.org/2007/opf"
xmlns:dc="http://purl.org/dc/elements/1.1/"
unique-identifier="bookid" version="2.0">
<metadata>
<dc:title>Hello World: My First EPUB</dc:title>
<dc:creator>My Name</dc:creator>
<dc:identifier id="bookid">urn:uuid:12345</dc:identifier>
<dc:language>en-US</dc:language>
<meta name="cover" content="cover-image" />
</metadata>
<manifest>
<item id="ncx" href="toc.ncx" media-type="text/xml"/>
<item id="cover" href="title.html" media-type="application/xhtml
+xml"/>
<item id="content" href="content.html" media-type="application/
xhtml+xml"/>
<item id="cover-image" href="images/cover.png" media-type="image/
png"/>
<item id="css" href="stylesheet.css" media-type="text/css"/>
</manifest>
<spine toc="ncx">
<itemref idref="cover" linear="no"/>
<itemref idref="content"/>
</spine>
<guide>
<reference href="title.html" type="cover" title="Cover"/>
</guide>
</package>
Or I guess I can use the nested list options to lsearch:
http://www.tcl.tk/man/tcl/TclCmd/lsearch.htm#M24
The strig map modification speeds things up slightly and removes one
more line of code !!
Martyn
I think we've identified the difficulty.
I'll be flagrantly obvious: "simple XML ..." is an oxymoron.
Haha, but isn't there some software called "SAX": Simple API for XML.
Nevertheless, and please correct me if I'm wrong, but seems to be yet
another area (besides OO), XML handling, where there seems to be quite
a bit of fragmentation within the Tcl communiity.
Yes. With any cool tool, random contraptions ranging from desperate
sh*t to jewels are doable. Turing completeness on one side, human
creativity on the other. So what ?
-Alex
The wise people among us use tDOM, provided we don't need streaming.
Yes, it's not a pure Tcl XML parser, but it *is* very very good. (The
wise people among us also don't worry too much about doing everything
in Tcl, not when the real treasures come from using C and Tcl
together.)
Donal.
If my XML needs weren't so simplistic (I've already demonstrated above
the first of the two steps I need to perform can be acheived with a
regex), and the application I'm trying to push forward didn't need to
run on multiple platforms, I too wouldn't care that tcldom isn't
written in pure Tcl.