Parsing namespaced XML with clojure.data.xml

646 views
Skip to first unread message

Matching Socks

unread,
Aug 20, 2016, 3:43:04 PM8/20/16
to Clojure
The future is XML-with-namespaces: POM files and whatnot.  Such cases are tricky because more than one notation is possible.  You need a namespace-enabled parser to figure out what the XML text really means.  Luckily, a contributed project, clojure.data.xml, can read XML-with-namespaces, and in good idiom return Clojure-namespaced keywords for the element names.  (Its present version is 0.1.0-beta1, a work-in-progress.)  You configure the namespaces to keywordize as its README illustrates:

(declare-ns "xml.html" "http://www.w3.org/1999/xhtml")
(parse-str "<?xml version=\"1.0\" encoding=\"UTF-8\"?>
            <foo:html xmlns:foo=\"http://www.w3.org/1999/xhtml\">
...

Could the same effect be obtained without the global state of namespace mappings?  Do all uses of clojure.data.xml in an app, even fully encapsulated uses, have to agree about the keyword for any given well-known XML namespace URI?

Herwig Hochleitner

unread,
Aug 21, 2016, 3:39:11 AM8/21/16
to clo...@googlegroups.com
2016-08-20 21:43 GMT+02:00 Matching Socks <phill...@gmail.com>:

Could the same effect be obtained without the global state of namespace mappings?  Do all uses of clojure.data.xml in an app, even fully encapsulated uses, have to agree about the keyword for any given well-known XML namespace URI?

Currently, that is the case. The motivation is to ensure value-equality for parse trees within an application.
Do you have a compelling use case for passing the namespace mapping into the parse call? 

Matching Socks

unread,
Aug 21, 2016, 12:47:16 PM8/21/16
to Clojure
Apps are cobbled together from sub-systems and libraries.  Some of those may use clojure.data.xml, either to share their products with their client or for their internal purposes.  As soon as two libraries on Clojars differ in their namespace-URI to keyword-namespace mapping, has the ship sunk? 

Nonetheless, value equality might sometimes be useful.  How to achieve it? 

There is already a globally distinct, agreed, and unambiguous way to refer to each well-known XML namespace URI.  It is the URI itself.  If the keyword-namespace must have the same properties, it ought to follow from the URI, not be left to the discretion of individual consumers. 

Could there be a well-known translation from namespace-URI to keyword-namespace and back?  These keyword namespaces would be cumbersome (as they must include the whole URI and also avoid colliding with the namespace of any other namespaced keywords anywhere), but consumers could alias them conveniently without impacting value comparisons:

(->> 'xmlns.http.www.w3.org.n1999.xhtml
     create-ns
     ns-name
     symbol
     (alias '
xhtml))

To account for the whole space of URIs, without violating the Clojure or EDN keyword namespace spec or compromising reverse translation back to the URI, you might have to go further.  For example, combine a legible symbol name computed with some loss (as an assertion) and a Base-64 encoding...

(->> 'xmlns.http.maven.apache.org.POM.4.0.0.aHR0cDovL21hdmVuLmFwYWNoZS5vcmcvUE9NLzQuMC4wCg==
     create-ns
     ns-name
     symbol
     (alias '
pom))

A well-known formula for namespace keywords representing XML namespaces could replace the ad-hoc mutable map and satisfy your dual aims that clojure.data.xml applications might use keywords for convenience while also maintaining strict value equality of the XML data structures all the way to the horizon.  (The data structure would use such keywords for all element tags.)

Herwig Hochleitner

unread,
Aug 22, 2016, 5:47:31 PM8/22/16
to clo...@googlegroups.com
I've been thinking this over. I'm starting feel that you are right in that the arbitrary, global mapping could cause more problems, than it would solve. Even if we could get by with a maintained registry, it would still be a burden to maintain and to use. Also, there is the open question of code expecting qnames, when suddenly, somebody declares a new xmlns mapping.

There is the possibility to canonicalize by cramming the xmlns uri into a readable kw-ns and that would still neatly reuse clojure's ns-alias facility. What I don't like about the approach is, that it would make even pretty-printed xml parse-trees quite unreadable. While :xmlns.dav/multistatus vs :xmlns.REFWOgo=/multistatus might not look as horrifying, consider :xmlns.aHR0cDovL3d3dy53My5vcmcvMTk5OS94aHRtbAo=/p for an xhtml paragraph.

Maybe it's time to give up on universal value equality of parsed xml and instead make the keyword - mapping a la carte, with a parser / emitter flag.
Technically, universal value equality is already challenged by the qname / keyword dichotomy and given that we want to retain using ::alias/keywords there is a decision to be made on whether to make qname the canonical representation and embrace the multitude of keyword mappings or whether to eliminate qnames and take the readability hit for canonicalizing the keyword representation. Do you see any alternative?

Matching Socks

unread,
Aug 24, 2016, 7:53:06 PM8/24/16
to Clojure
Namespaced XML is inherently value-comparable and unambiguous.  It would be shame to give up on that, and disperse the burden throughout every layer of library and consumer.

Pretty-printing need not be a concern of the XML parsing library.  Everyone seems to be interested nowadays in easing the usage of namespaced keywords.  Perhaps printing could be improved (globally) to use the caller's keyword namespace aliases. 

Anyway, pretty-printing is always expensive.  If a keyword-conversion step must encumber either pretty-printing or everything else, better do it in pretty-printing.

Keyword *literals* make the source code easy to read, but composing keywords programmatically with a caller-provided namespace might be intolerable.  Moreover, providing those namespace mappings would be a messy headache for the consumer of XML processing libraries.  The mappings would have to pass through layer after layer.  No doubt, every library will provide different defaults.  One false step, and you would lose value comparability.

By contrast!, with well-known keyword namespaces, computed by a well-known function from their respective well-known namespace URI, everyone could write source code using keyword literals with whatever keyword namespace alias they want, and XML structures would be value-comparable.  In the short run, the best pretty-print might be actual XML serialization.  In the long run, I predict, Clojure's namespaced keywords will go down as smooth as fudge.

By all means, use an encoding more legible than Base64.  URLEncoder could be an example in the way it uses %.  Pick an escape character that's legal in Clojure namespace names, but unusual in the best-known namespace URIs.  Apostrophe?

Herwig Hochleitner

unread,
Aug 28, 2016, 5:54:42 AM8/28/16
to clo...@googlegroups.com
2016-08-25 1:53 GMT+02:00 Matching Socks <phill...@gmail.com>:
Namespaced XML is inherently value-comparable and unambiguous.  It would be shame to give up on that, and disperse the burden throughout every layer of library and consumer.

That's a very good point. Disregarding concerns for edn syntax for a moment, the best solution would seem to standardize on qnames, since those are the jvm-wide canonical mapping. With reader tags, they are almost there in terms of read/writability, but not quite as convenient as ::alias/keywords

When designing the namespacing support, I discarded the idea to cram uris into keywords, because of the impedance mismatch. I thank you for bringing it up again, though, because I overlooked a very important failure case, when trying to preserve value semantics:

Say, a library chooses not to use keyword mapping for a given xmlns and instead matches on qname instances, but then somebody within the system establishes an alias for that xmlns. Said library will then silently get keywordized data, that it won't recognize anymore. Unfortunately, I don't know how to catch that with an explicit error either.

So yes, I am open to experimenting with encoding uris into keywords.

Pretty-printing need not be a concern of the XML parsing library.  Everyone seems to be interested nowadays in easing the usage of namespaced keywords.  Perhaps printing could be improved (globally) to use the caller's keyword namespace aliases.

Yes, it doesn't feel right to let an easily adaptable concern like pretty-printing dictate value semantics.

By all means, use an encoding more legible than Base64.  URLEncoder could be an example in the way it uses %.  Pick an escape character that's legal in Clojure namespace names, but unusual in the best-known namespace URIs.  Apostrophe?

Yes, or maybe map to unicode lookalikes and escape those, should they ever occur in a ns-uri.
e.g.

/ -> 
: -> 

Though, those are not alphanumeric characters, so probably illegal in keywords.

Any thoughts so far?

Matching Socks

unread,
Sep 17, 2016, 9:10:15 AM9/17/16
to Clojure
To make a URI into a Clojure keyword namespace, we may simply replace
the 11 URI characters that are forbidden or problematic in keywords
with Unicode-alphabetic characters outside Latin-1.

The substitutes should be present in common desktop fonts, and should
not be mistaken for Latin-1 characters.  They should come from a
single Unicode script, to avoid burdensome Unicode puns.  It should
be a raster script that does not require decades of handwriting practice.

Cyrillic fits the bill very well:  it's recognizable and out-of-band.  You'd
never type these URI keywords in, but Cyrillic is a software-selectable
keyboard so you could if you felt like it.

  http://www.cs.yale.edu/~perlis-alan/quotes.html
  httpцЛЛwwwЯcsЯyaleЯeduЛжperlis-alanЛquotesЯhtml

Here is a demonstration of a simple URI <-> keyword translator
and a keyword-namespace aliasing macro to facilitate relatively
painless use of namespace literals in source code.

(To furthermore overcome the problem that "%" hex expressions compare
case-blind in URIs, but not keywords, we should norm %xx to %XX as RFC
3986 recommends before converting to a keyword namespace.)

(def problems  [\. \~ \: \/ \[ \] \@ \( \) \, \;])
(def solutions [\Я \д])

(defn- tr [a b]
 
(let [m (zipmap a b)]
   
(fn [s]
     
(apply str (map #(m % %) s)))))

(def uri->kwns
 
(tr problems solutions))

(def kwns->uri
 
(tr solutions problems))

(defmacro alias-xml-ns [sym uri]
 
`(let [kwns# (symbol (uri->kwns ~uri))]
    (create-ns kwns#)
    (alias ~sym kwns#)))

(comment

  (uri->kwns "http://www.w3.org/2000/01/rdf-schema#")
 
  (kwns->uri *1)

  (alias-xml-ns 'html "http://www.w3.org/1999/xhtml")

  (assert
   (identical? ::html/aside
               :httpцЛЛwwwЯw3ЯorgЛ1999Лxhtml/aside))

)


Herwig Hochleitner

unread,
Sep 17, 2016, 10:15:50 AM9/17/16
to clo...@googlegroups.com
2016-09-17 15:10 GMT+02:00 Matching Socks <phill...@gmail.com>:
To make a URI into a Clojure keyword namespace, we may simply replace
the 11 URI characters that are forbidden or problematic in keywords
with Unicode-alphabetic characters outside Latin-1.

Yep, I've been thinking along those lines as well. We'd still need an escape character, since unicode uris are a thing, but at least we could substitute :,/,... without it.

The substitutes should be present in common desktop fonts, and should
not be mistaken for Latin-1 characters.  They should come from a
single Unicode script, to avoid burdensome Unicode puns.  It should
be a raster script that does not require decades of handwriting practice.

Cyrillic fits the bill very well:  it's recognizable and out-of-band.  You'd
never type these URI keywords in, but Cyrillic is a software-selectable
keyboard so you could if you felt like it.

  http://www.cs.yale.edu/~perlis-alan/quotes.html
  httpцЛЛwwwЯcsЯyaleЯeduЛжperlis-alanЛquotesЯhtml

Cyrillic might serve us well, but maybe is a set of dedicated substitution characters in unicode?

I'm still concerned, that doing this might be viewed as an ugly hack, but I think, being able to reuse namespace aliasing is a powerful proposition...

Matching Socks

unread,
Sep 17, 2016, 11:17:45 AM9/17/16
to Clojure
No escape needed; or, rather, no need to invent an escape.  A mapping from IRI to URI is already specified in RFC 3987, "Internationalized Resource Identifiers (IRIs)".  Just translate IRI to URI, and thence to keyword. 

Herwig Hochleitner

unread,
Sep 24, 2016, 3:27:18 PM9/24/16
to clo...@googlegroups.com
What about skipping the alphabet translation and just doing uri encoding?

{http://www.w3.org/1999/xhtml}pre
=> :http%3A%2F%2Fwww.w3.org%2F1999%2Fxhtml/pre
doesn't seem so bad and this way we would get uniformity without weird corner cases.

Herwig Hochleitner

unread,
Sep 28, 2016, 9:09:07 PM9/28/16
to clo...@googlegroups.com
So, your comment about using uri-encoding inspired me to just use that as an encoding to fit in a kw-ns. It seems to work out: https://github.com/bendlas/data.xml/commit/22cbe21181175d302c884b4ec9162bd5ebf336d7

There is a couple of open issues, that I commented on the commit.
I'll open a dev-thread about the possibility of making clojure.core/alias auto-creating, with varars and expose it as (ns (:alias al nnnnn ak mmmm)). That would make this incarnation of data.xml very convenient to use, as well as solve similar cases for creating a namespace just for the sake of naming keywords in them.

Herwig Hochleitner

unread,
Sep 28, 2016, 10:15:54 PM9/28/16
to clo...@googlegroups.com
Reply all
Reply to author
Forward
0 new messages