xml.clj + tagsoup

16 views
Skip to first unread message

Chouser

unread,
Feb 22, 2008, 2:41:03 PM2/22/08
to Clojure
I seem to be more or less constantly writing HTML screen-scrapers, but
I have yet to find a really nice way to do it. Maybe clojure will be
my salvation!
With that as my goal, I tried to integrate TagSoup with clojure's
xml.clj, and it seems to work quite nicely.

Just replace xml.clj's parse function with:

(defn startparse-sax [s ch]
(.. SAXParserFactory (newInstance) (newSAXParser) (parse s ch)))

(defn parse
([s] (parse s startparse-sax))
([s startparse]
(binding [*stack* nil
*current* (struct element)
*state* :between
*sb* nil]
(startparse s content-handler)
((:content *current*) 0))))

Now (xml/parse "foo.xml") works as it did before, but you can plug in
other parsers if you want. For TagSoup:

(defn startparse-tagsoup [s ch]
(let [p (new org.ccil.cowan.tagsoup.Parser)]
(. p (setContentHandler ch))
(. p (parse s))))

(xml/parse "foo.html" startparse-tagsoup)

And you're off and running. Now all we need is a nice query language
for the vector/map tree that gives you...

--Chouser

John Cowan

unread,
Feb 22, 2008, 3:51:31 PM2/22/08
to clo...@googlegroups.com
On Fri, Feb 22, 2008 at 2:41 PM, Chouser <cho...@gmail.com> wrote:

> (defn startparse-tagsoup [s ch]
> (let [p (new org.ccil.cowan.tagsoup.Parser)]
> (. p (setContentHandler ch))
> (. p (parse s))))
>
> (xml/parse "foo.html" startparse-tagsoup)
>
> And you're off and running.

Excellent! Now you may thank Rich and me in whatever way occurs to you. :-)

--
GMail doesn't have rotating .sigs, but you can see mine at
http://www.ccil.org/~cowan/signatures

Reply all
Reply to author
Forward
0 new messages