Spare w3.org with Resolving XML Reader

20 views
Skip to first unread message

Stuart Sierra

unread,
Feb 27, 2008, 1:35:51 PM2/27/08
to Clojure
Hi all,
I was reading this: http://www.w3.org/blog/systeam/2008/02/08/w3c_s_excessive_dtd_traffic
and wanted to get Clojure's xml.clj to use a catalog-resolving XML
parser so it doesn't download the XHTML DTD's on every job.

Start here: http://xml.apache.org/commons/components/resolver/
You need resolver.jar from the Apache XML Commons on your Java
CLASSPATH.
The hard part is getting your CatalogManager.properties and your
catalog files set up correctly -- that took me all of this morning,
and required writing my own catalog. You can use "sudo ngrep -q dtd"
to see if the parser is downloading the DTD every time.

The easy part is altering xml.clj:

Index: src/xml.clj
===================================================================
--- src/xml.clj (revision 698)
+++ src/xml.clj (working copy)
@@ -10,7 +10,7 @@
(clojure/refer 'clojure)

(import '(org.xml.sax ContentHandler Attributes SAXException)
- '(javax.xml.parsers SAXParser SAXParserFactory))
+ '(org.apache.xml.resolver.tools ResolvingXMLReader))

(def *stack*)
(def *current*)
@@ -64,7 +64,9 @@
nil)))))

(defn startparse-sax [s ch]
- (.. SAXParserFactory (newInstance) (newSAXParser) (parse s ch)))
+ (let [parser (new ResolvingXMLReader)]
+ (. parser (setContentHandler ch))
+ (. parser (parse s))))

(defn parse
([s] (parse s startparse-sax))

Rich Hickey

unread,
Feb 27, 2008, 3:04:08 PM2/27/08
to Clojure


On Feb 27, 1:35 pm, Stuart Sierra <the.stuart.sie...@gmail.com> wrote:
> Hi all,
> I was reading this:http://www.w3.org/blog/systeam/2008/02/08/w3c_s_excessive_dtd_traffic
> and wanted to get Clojure's xml.clj to use a catalog-resolving XML
> parser so it doesn't download the XHTML DTD's on every job.
>
> Start here:http://xml.apache.org/commons/components/resolver/
> You need resolver.jar from the Apache XML Commons on your Java
> CLASSPATH.
> The hard part is getting your CatalogManager.properties and your
> catalog files set up correctly -- that took me all of this morning,
> and required writing my own catalog. You can use "sudo ngrep -q dtd"
> to see if the parser is downloading the DTD every time.
>
> The easy part is altering xml.clj:
>

It's even easier than that. The idea behind the startparse arg is that
you could do something like this:

(defn startparse-apache [s ch]
(doto (new ResolvingXMLReader) (setContentHandler ch) (parse s)))

(xml/parse s startparse-apache)

without touching xml.clj

(as originally suggested here: http://groups.google.com/group/clojure/msg/931ddaf972e69e6d)

Rich

John Cowan

unread,
Feb 27, 2008, 3:08:07 PM2/27/08
to clo...@googlegroups.com
On Wed, Feb 27, 2008 at 3:04 PM, Rich Hickey <richh...@gmail.com> wrote:

> It's even easier than that. The idea behind the startparse arg is that
> you could do something like this:
>
> (defn startparse-apache [s ch]
> (doto (new ResolvingXMLReader) (setContentHandler ch) (parse s)))
>
> (xml/parse s startparse-apache)
>
> without touching xml.clj

By default, though, xml.clj should make use of a local catalog and
cache in order to be a good Internet citizen.

--
GMail doesn't have rotating .sigs, but you can see mine at
http://www.ccil.org/~cowan/signatures

Rich Hickey

unread,
Feb 27, 2008, 3:25:54 PM2/27/08
to Clojure


On Feb 27, 3:08 pm, "John Cowan" <johnwco...@gmail.com> wrote:

> By default, though, xml.clj should make use of a local catalog and
> cache in order to be a good Internet citizen.
>

I'd be happy to do so - is there a way to do that with the stock JDK
1.5?

Rich

Stuart Sierra

unread,
Feb 27, 2008, 4:35:17 PM2/27/08
to Clojure
On Feb 27, 3:08 pm, "John Cowan" <johnwco...@gmail.com> wrote:
> By default, though, xml.clj should make use of a local catalog and
> cache in order to be a good Internet citizen.

On Feb 27, 3:25 pm, Rich Hickey <richhic...@gmail.com> wrote:
> I'd be happy to do so - is there a way to do that with the stock JDK
> 1.5?

You could implement a custom org.xml.sax.EntityResolver, but then you
would have to implement or copy all the catalog resolvers -- JDK
doesn't provide anything.

I suppose you could implement an EntityResolver that "knows" the major
DTDs, but you would still have to include copies of those DTDs in the
source.

-Stuart

John Cowan

unread,
Feb 27, 2008, 9:59:38 PM2/27/08
to clo...@googlegroups.com
On Wed, Feb 27, 2008 at 4:35 PM, Stuart Sierra
<the.stua...@gmail.com> wrote:

> I suppose you could implement an EntityResolver that "knows" the major
> DTDs, but you would still have to include copies of those DTDs in the
> source.

Not necessarily. You could just cache them as they are downloaded.

Rich Hickey

unread,
Feb 28, 2008, 8:31:42 AM2/28/08
to Clojure


On Feb 27, 9:59 pm, "John Cowan" <johnwco...@gmail.com> wrote:
> On Wed, Feb 27, 2008 at 4:35 PM, Stuart Sierra
>
> <the.stuart.sie...@gmail.com> wrote:
> > I suppose you could implement an EntityResolver that "knows" the major
> > DTDs, but you would still have to include copies of those DTDs in the
> > source.
>
> Not necessarily. You could just cache them as they are downloaded.
>

I'm sorry, these XML APIs are just not my area of expertise. If there
is something specific and straightforward I can/should do with the
stock JDK 1.5, please let me know what it is.

Rich

Stuart Sierra

unread,
Feb 28, 2008, 9:31:20 AM2/28/08
to Clojure
On Feb 28, 8:31 am, Rich Hickey <richhic...@gmail.com> wrote:
> I'm sorry, these XML APIs are just not my area of expertise. If there
> is something specific and straightforward I can/should do with the
> stock JDK 1.5, please let me know what it is.

Unfortunately, I don't think there is -- this is a problem with the
Java XML libraries, and "they" should fix it.
-Stuart
Reply all
Reply to author
Forward
0 new messages