List-Oriented XML Parser

70 views
Skip to first unread message

Steve Harris

unread,
Feb 23, 2008, 12:09:21 PM2/23/08
to Clojure
Here's an sxml style parser based on Rich's original xml.clj, in case
anyone is interested.
It's not meant to compete with xml.clj, its main purpose is to help
with a port of sxpath/sxslt if I ever get around to it. It's just
like sxml except the attributes are clojure maps. It does handle
mixed content (I see Rich has added that too in SVN, great!) and
attempts to do the right thing by optionally ignoring whitespace
between elements.

Produced content looks like this:
(*top*
(account {:title "Savings 1"}
(ownerid "12398")
(balance {:currency "USD"} "3212.12")
(descr-html "Main " (b "short term savings") " account.")))


PS: I'm sure the source will be garbled by randomly placed line
breaks. Is there a way to upload files here? I was able to see the
"Files" page but I don't see an upload link...


; Copyright (c) Rich Hickey. All rights reserved.
; The use and distribution terms for this software are covered by
the
; Common Public License 1.0 (http://opensource.org/licenses/cpl.php)
; which can be found in the file CPL.TXT at the root of this
distribution.
; By using this software in any fashion, you are agreeing to be
bound by
; the terms of this license.
; You must not remove this notice, or any other, from this software.

(in-ns 'xml)
(clojure/refer 'clojure)

(import '(org.xml.sax ContentHandler Attributes SAXException)
'(javax.xml.parsers SAXParser SAXParserFactory)
'(org.xml.sax InputSource))

(def *stack*)
(def *current*)
(def *pending-chars*)
(def *state*)

(defn finalize-element [e] (reverse e))


(defn add-pending-char-data []
(set! *current* (conj *current* (str *pending-chars*)))
(set! *pending-chars* nil))


(defn all-whitespace? [chars-array start len]
(loop [i (+ start (dec len))]
(if (< i start)
true
(if (not (. Character (isWhitespace (aget chars-array i))))
false
(recur (dec i))))))


(defn content-handler [opts]
(new clojure.lang.XMLHandler
(implement [ContentHandler]

(startElement [uri local-name q-name #^Attributes atts]
(let [make-attrs (fn [ret i]
(if (neg? i)
ret
(recur (assoc ret
(. clojure.lang.Keyword (intern (symbol (. atts (getQName
i)))))
(. atts (getValue i)))
(dec i))))

attrs (make-attrs {} (dec (. atts (getLength))))

new-el (if (. attrs (isEmpty))
(list (symbol q-name))
(list attrs (symbol q-name)))]

(when *pending-chars*
(let [ignore (and (:ignore-whitespace-between-elements opts)
(or (= *state* :ws-read-after-element-start)
(= *state* :ws-read-after-element-end))) ]
(if ignore
(set! *pending-chars* nil)
(add-pending-char-data))))
(set! *stack* (conj *stack* *current*))
(set! *current* new-el)
(set! *state* :element-started))
nil)


(endElement [uri local-name q-name]
(when *pending-chars*
(let [ignore (and (:ignore-whitespace-between-elements opts)
(= *state* :ws-read-after-element-end)) ]
(if ignore
(set! *pending-chars* nil)
(add-pending-char-data))))
(set! *current* (conj (peek *stack*) (finalize-element
*current*)))
(set! *stack* (pop *stack*))
(set! *state* :element-ended)
nil)


(characters [cdata start len]
(when-not *pending-chars*
(set! *pending-chars* (new StringBuilder)))
(let [#^StringBuilder sb *pending-chars*]
(. sb (append cdata start len))
(set! *state*
(if (and (:ignore-whitespace-between-elements opts)
(all-whitespace? cdata start len))
(cond
(or (= *state* :element-started) (= *state* :ws-read-after-
element-start)) :ws-read-after-element-start
(or (= *state* :element-ended) (= *state* :ws-read-after-
element-end)) :ws-read-after-element-end
true :chars-read)
:chars-read)))
nil))))

;; TODO:
;; Add option: :validating (in which case tell parser to ignore
ignorable whitespace).
;; Make parser namespace aware (test - what's the difference?)

(defn parse
([s] (parse s {:ignore-whitespace-between-elements true}))
([s opts]
(let [p (.. SAXParserFactory (newInstance) (newSAXParser))]
(binding [*stack* nil
*current* '(*top*)
*state* nil
*pending-chars* nil]
(. p (parse (new InputSource s) (content-handler opts)))
(finalize-element *current*)))))


(import '(java.io StringReader))
(defn test1 []
(let [ cxml '(*top*
(account {:title "Savings 1"}
(ownerid "12398")
(balance {:currency "USD"} "3212.12")
(descr-html "Main " (b "short term savings") " account.")))
xml (str "<account title='Savings 1'>"
"<ownerid>12398</ownerid>"
"<balance currency=\"USD\">3212.12</balance>"
"<descr-html>Main <b>short term savings</b> account.</descr-
html>"
"</account>") ]
(assert (= cxml (parse (new StringReader xml))))
println "Test succeeded."))

Steve Harris

unread,
Feb 23, 2008, 1:15:46 PM2/23/08
to Clojure
I put it in the examples section of the Wiki where it's more readable:

http://en.wikibooks.org/wiki/Clojure_Programming#Examples


Rich Hickey

unread,
Feb 23, 2008, 2:25:51 PM2/23/08
to Clojure


On Feb 23, 12:09 pm, Steve Harris <steveO...@gmail.com> wrote:
> Here's an sxml style parser based on Rich's original xml.clj, in case
> anyone is interested.

> PS: I'm sure the source will be garbled by randomly placed line
> breaks. Is there a way to upload files here? I was able to see the
> "Files" page but I don't see an upload link...
>

Thanks Steve - I've enabled file uploads for members if you want to
put it up intact. It will have to come off of the Wiki though, as it
is derived from the Clojure CPL source, and the Wiki is GNU Free
Documentation License.

On a technical note, I know you are following SXML, but did you
consider using vectors instead of lists?

[*top*
[account {:title "Savings 1"}
[ownerid "12398"]
[balance {:currency "USD"} "3212.12"]
[descr-html "Main " [b "short term savings"] " account."]]]

I presume code that uses this structure will have to examine the type
of the second element for map? in order to determine if there are
attributes?

The reason I ask is that I had/have a version of the XML parser that
emits:

[*top* {}
[account {:title "Savings 1"}
[ownerid {} "12398"]
[balance {:currency "USD"} "3212.12"]
[descr-html {} "Main " [b {} "short term savings"] "
account."]]]

which I go back and forth on using instead of my current
representation. I think it is much more read/writable for humans, and,
now with subvec, access to the contents can be made very fast. It
might also be a bit easier to use with a zipper I've been working on.

Anyone have any thoughts?

Rich

Steve Harris

unread,
Feb 23, 2008, 4:00:41 PM2/23/08
to Clojure
I think I like the idea of using vectors instead, because besides the
efficiency issues the brackets are a little easier to peck on the
keyboard. Strict compatibility with SXML isn't too important to me
(which is why I tossed out the alists in favor of the maps for the
attributes) - I think most of the porting difficulty will be with the
macros anyway at least for my case.

You're right I was thinking about having to examine the second item at
some point to look for attributes. I just assumed I'd always forget to
add them in literals when they're empty. That's not really an issue
for the xml parser's output though, so I guess you're right it should
just always supply them in its output. SXML seems to have different
levels of "normalization" because of issues like this (http://
okmij.org/ftp/Scheme/SXML.html#Normalized%20SXML), with most functions
acting on the normalized sxml. I assume sxml from outside gets
funneled through a "normalizer" before processing (which is a pretty
simple single-pass or even lazy thing I think), but I haven't looked
to verify.


PS: I'm removing the code from the wiki (sorry I'm oblivious
sometimes). I tried uploading to the files area but google
unhelpfully tells me "failed" and that's all (google's kindof
"minimal" sometimes, isn't it?). Problem could be on my end, so I'll
try again tomorrow or Mon.

Thanks for the comments.

Cheers
Steve




Rich Hickey

unread,
Feb 25, 2008, 10:11:19 PM2/25/08
to Clojure


On Feb 23, 4:00 pm, Steve Harris <steveO...@gmail.com> wrote:

> I tried uploading to the files area but google
> unhelpfully tells me "failed" and that's all (google's kindof
> "minimal" sometimes, isn't it?). Problem could be on my end, so I'll
> try again tomorrow or Mon.
>

Would you mind trying again?

Thanks,

Rich

Steve Harris

unread,
Feb 26, 2008, 1:37:53 PM2/26/08
to Clojure
Still no worky, same simple "failed" message.

Tried on OS X yesterday and today on Linux and Windows, using Firefox
in all cases. Also tried IE 7 just now but didn't see the upload
button at all, though that could just reflect our paranoid setup here
at work.

PS: no hurry on my end, and feel free to reply off list if you want me
to try again...


STeve
Reply all
Reply to author
Forward
0 new messages