Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Where can I find a simple XML parser in pure Tcl

1,342 views
Skip to first unread message

jemptymethod

unread,
Jan 5, 2011, 4:43:02 PM1/5/11
to
I'd like to find a simple XML parser in pure Tcl, for instance a
single file I can drop into the lib directory of a starkit. I've
searched of course, but libraries I've found such as tcldom and some
others seem to be dependent on native code, correct me if I'm wrong?

My needs are very simple, I just want to parse a couple of xml files
that can be found in the manifest for epub files. I have actually
tried a parsing routine I found at http://wiki.tcl.tk/3919 but it
throws an error.

I've edited that page to show the input I'm using; the xml is so
simple I could probably use a regular expression, as all I need to do
is extract the value of the rootfile tag's full-path attribute. The
file corresponding to full-path however promises to be a somewhat more
substantial xml file (despite the .opf extension) which I will in turn
need to parse.

Thanks in advance. If I botched the formatting a bit of my wiki edit
please feel free to correct it, or me!

Alexandre Ferrieux

unread,
Jan 5, 2011, 5:03:28 PM1/5/11
to
On Jan 5, 10:43 pm, jemptymethod <jemptymet...@gmail.com> wrote:
> I'd like to find a simple XML parser in pure Tcl, for instance a
> single file I can drop into the lib directory of a starkit.  I've
> searched of course, but libraries I've found such as tcldom and some
> others seem to be dependent on native code, correct me if I'm wrong?
>
> My needs are very simple, I just want to parse a couple of xml files
> that can be found in the manifest for epub files.  I have actually
> tried a parsing routine I found athttp://wiki.tcl.tk/3919but it

> throws an error.
>
> I've edited that page to show the input I'm using; the xml is so
> simple I could probably use a regular expression, as all I need to do
> is extract the value of the rootfile tag's full-path attribute.  The
> file corresponding to full-path however promises to be a somewhat more
> substantial xml file (despite the .opf extension) which I will in turn
> need to parse.
>
> Thanks in advance.  If I botched the formatting a bit of my wiki edit
> please feel free to correct it, or me!

The wiki is not the proper place for bugreports or calls for help.
Here is the place.
So I'm copying your example back here:

> I tried to use this routine on a small xml file but got the following error:
>
> list element in quotes followed by "?" instead of space
> while executing
> "lrange [string map {= " "} $item] 1 end"
> (procedure "xml2list" line 25)
> invoked from within
> "xml2list $file_data"
>
> Input was:
> <?xml version="1.0"?> ...

Clearly the problem is that the code is not prepared to handle that
special prefix, and the "?" ends up where it shouldn't. A simple fix
is to just remove it beforehand:

regsub {^\s*<[?][^?]*[?]>} $input "" input

-Alex

jemptymethod

unread,
Jan 5, 2011, 5:46:29 PM1/5/11
to
On Jan 5, 5:03 pm, Alexandre Ferrieux <alexandre.ferri...@gmail.com>
wrote:

Thanks for your help both as to code and how to (not) use the wiki. I
will definitely try your suggested technique on the file indicated by
the full-path attribute: "OEBPS/content.opf"

In the meantime though I came up with a regex for finding that
attribute within the input; my Tcl/regex skills need some sharpening
anyway:

regexp {full-path=["'](.*?)["']}

jemptymethod

unread,
Jan 5, 2011, 6:31:48 PM1/5/11
to

Comparing the output of the above XML as a list to its original string
content, I wonder if it might not be more useful to convert xml to a
(nested) dictionary?

The following:

package {xmlns http://www.idpf.org/2007/opf xmlns:dc http://purl.org/dc/elements/1.1/
unique-identifier bookid version 2.0} {{me
tadata {} {{dc:title {} {{{#text} {Hello World: My First EPUB}}}}
{dc:creator {} {{{#text} {My Name}}}} {dc:identifier {id booki
d} {{{#text} urn:uuid:12345}}} {dc:language {} {{{#text} en-US}}}
{meta {name cover content cover-image} {}}}} {manifest {} {{it
em {id ncx href toc.ncx media-type text/xml} {}} {item {id cover href
title.html media-type application/xhtml+xml} {}} {item {id
content href content.html media-type application/xhtml+xml} {}} {item
{id cover-image href images/cover.png media-type image/pn
g} {}} {item {id css href stylesheet.css media-type text/css} {}}}}
{spine {toc ncx} {{itemref {idref cover linear no} {}} {item
ref {idref content} {}}}} {guide {} {{reference {href title.html type
cover title Cover} {}}}}}

Is what results from:

<?xml version='1.0' encoding='utf-8'?>
<package xmlns="http://www.idpf.org/2007/opf"
xmlns:dc="http://purl.org/dc/elements/1.1/"
unique-identifier="bookid" version="2.0">
<metadata>
<dc:title>Hello World: My First EPUB</dc:title>
<dc:creator>My Name</dc:creator>
<dc:identifier id="bookid">urn:uuid:12345</dc:identifier>
<dc:language>en-US</dc:language>
<meta name="cover" content="cover-image" />
</metadata>
<manifest>
<item id="ncx" href="toc.ncx" media-type="text/xml"/>
<item id="cover" href="title.html" media-type="application/xhtml
+xml"/>
<item id="content" href="content.html" media-type="application/
xhtml+xml"/>
<item id="cover-image" href="images/cover.png" media-type="image/
png"/>
<item id="css" href="stylesheet.css" media-type="text/css"/>
</manifest>
<spine toc="ncx">
<itemref idref="cover" linear="no"/>
<itemref idref="content"/>
</spine>
<guide>
<reference href="title.html" type="cover" title="Cover"/>
</guide>
</package>

jemptymethod

unread,
Jan 5, 2011, 9:39:10 PM1/5/11
to
> package {xmlnshttp://www.idpf.org/2007/opfxmlns:dchttp://purl.org/dc/elements/1.1/

> unique-identifier bookid version 2.0} {{me
> tadata {} {{dc:title {} {{{#text} {Hello World: My First EPUB}}}}
> {dc:creator {} {{{#text} {My Name}}}} {dc:identifier {id booki
> d} {{{#text} urn:uuid:12345}}} {dc:language {} {{{#text} en-US}}}
> {meta {name cover content cover-image} {}}}} {manifest {} {{it
> em {id ncx href toc.ncx media-type text/xml} {}} {item {id cover href
> title.html media-type application/xhtml+xml} {}} {item {id
>  content href content.html media-type application/xhtml+xml} {}} {item
> {id cover-image href images/cover.png media-type image/pn
> g} {}} {item {id css href stylesheet.css media-type text/css} {}}}}
> {spine {toc ncx} {{itemref {idref cover linear no} {}} {item
> ref {idref content} {}}}} {guide {} {{reference {href title.html type
> cover title Cover} {}}}}}
>
> Is what results from:
>
> <?xml version='1.0' encoding='utf-8'?>
>
> SNIP

Or I guess I can use the nested list options to lsearch:
http://www.tcl.tk/man/tcl/TclCmd/lsearch.htm#M24

MSEdit

unread,
Jan 6, 2011, 11:17:06 AM1/6/11
to

Have you tried TAX I use this for several files even up to 10MB with
only a slight delay

http://wiki.tcl.tk/14534

The strig map modification speeds things up slightly and removes one
more line of code !!

Martyn

keithv

unread,
Jan 6, 2011, 4:34:11 PM1/6/11
to
On Jan 5, 4:43 pm, jemptymethod <jemptymet...@gmail.com> wrote:
> I'd like to find a simple XML parser in pure Tcl, for instance a
> single file I can drop into the lib directory of a starkit.  I've
> searched of course, but libraries I've found such as tcldom and some
> others seem to be dependent on native code, correct me if I'm wrong?
>
> My needs are very simple, I just want to parse a couple of xml files
> that can be found in the manifest for epub files.  I have actually
> tried a parsing routine I found athttp://wiki.tcl.tk/3919but it

> throws an error.
>
> I've edited that page to show the input I'm using; the xml is so
> simple I could probably use a regular expression, as all I need to do
> is extract the value of the rootfile tag's full-path attribute.  The
> file corresponding to full-path however promises to be a somewhat more
> substantial xml file (despite the .opf extension) which I will in turn
> need to parse.
>
> Thanks in advance.  If I botched the formatting a bit of my wiki edit
> please feel free to correct it, or me!

http://wiki.tcl.tk/11020

Cameron Laird

unread,
Jan 7, 2011, 10:23:30 AM1/7/11
to
On Jan 5, 3:43 pm, jemptymethod <jemptymet...@gmail.com> wrote:
> I'd like to find a simple XML parser ...
...
[considerable detail about Tcl
and other matters]
...


I think we've identified the difficulty.

I'll be flagrantly obvious: "simple XML ..." is an oxymoron.

jemptymethod

unread,
Jan 8, 2011, 10:04:13 AM1/8/11
to

Haha, but isn't there some software called "SAX": Simple API for XML.
Nevertheless, and please correct me if I'm wrong, but seems to be yet
another area (besides OO), XML handling, where there seems to be quite
a bit of fragmentation within the Tcl communiity.

Alexandre Ferrieux

unread,
Jan 8, 2011, 10:20:59 AM1/8/11
to

Yes. With any cool tool, random contraptions ranging from desperate
sh*t to jewels are doable. Turing completeness on one side, human
creativity on the other. So what ?

-Alex

Donal K. Fellows

unread,
Jan 8, 2011, 10:24:35 AM1/8/11
to
On Jan 8, 3:04 pm, jemptymethod <jemptymet...@gmail.com> wrote:
> Haha, but isn't there some software called "SAX": Simple API for XML.
> Nevertheless, and please correct me if I'm wrong, but seems to be yet
> another area (besides OO), XML handling, where there seems to be quite
> a bit of fragmentation within the Tcl communiity.

The wise people among us use tDOM, provided we don't need streaming.
Yes, it's not a pure Tcl XML parser, but it *is* very very good. (The
wise people among us also don't worry too much about doing everything
in Tcl, not when the real treasures come from using C and Tcl
together.)

Donal.

jemptymethod

unread,
Jan 9, 2011, 9:15:53 AM1/9/11
to
On Jan 8, 10:24 am, "Donal K. Fellows"

If my XML needs weren't so simplistic (I've already demonstrated above
the first of the two steps I need to perform can be acheived with a
regex), and the application I'm trying to push forward didn't need to
run on multiple platforms, I too wouldn't care that tcldom isn't
written in pure Tcl.

0 new messages