Hello Everyone,I'm writing a little WebDAV frontend for a custom storage server, so I decided to push data.xml's namespaced XML support up to snuff.There is a ticket and design page for this:However, I've taken a fresh approach, which, I think, will fit well with how things are done in clojure. Namespaced xml currently is the sore thumb of clojure data support and as you will see, it's easily fixed if we manage to agree on an interface. In addition to feedback on my current work, I'm soliciting propositions on which API we want _specifically_ for namespaced XML (Step 3), because the rest is pretty much reasoned from design constraints.I'll summarize my current work (https://github.com/bendlas/data.xml) here and if people think it's the right direction, I'll update the design page and finish up.Step 1: Getting namespaced XML to roundtrip properly-----------This is done by just mapping keyword namespaces <-> xmlns prefixes.With that it's possible to emit (or parse) broken XML with bogus namespaces and you have complete control over the generated prefixes. This is pretty much the current interface, bugfixed and I've already attached a patch for this to DXML-4.eg.<element xmlns:A="AURI:"><A:foo B:attr="..." /></element><!-->{:tag :element :attrs {:xmlns/A "AURI"} :content [{:tag :A/foo :attrs {:B/attr "..."}]}Note how this is already sufficient to do any kind of namespaced XML processing and it's a great baseline representation because of full roundtripability, but it requires consumers to resolve prefixes.
So we need another representation, where names are qualified by URI.Step 2: Representing uri-namespaced names directly----------Current work uses a custom defrecord XmlName as the data structure, but javax.xml.namespace.QName seems fully appropriate, so will use that as of next revision.
URI namespaced names can be used in place of keywords in an xml tree, appropriate xmlns* declarations have to be in place in for emitting, in order to assign a prefix to the name.
When parsing, keywords can be replaced with those names, either by parser option or by a separate tree walker, that keeps track of xmlns* attributes.
Step 3: Helpers-----------Currently, I've implemented the following helpers for namespaced XML:- resolve-name: generate uri-namespaced name from prefixed name
- walk-resolve-names: the tree walker mentioned in Step 2- walk-cleanup-prefixes: remove redundant prefixes (multiple bound to same uri)- with-xmlns (macro): syntactically replace namespaced keywords with xml-names in source codeThe primary purpose of this is to denote xml-names in query functions et al. To generate straight namespaced xml fragments, just declare appropriate xmlns* attributes on it.More possibilities:- find the minimal set of necessary xmlns uris to declare on a root element, in order to be able to emit the whole fragment. This will be non-lazy, since it adds a pass before actualy emitting. For large xml data, it's recommended to statically add possible xmlns* attributes and emit lazy with walk-cleanup-prefixes.
- an xml zipper giving access to the namespace environment at each locationthanks for taking the time to review this, let's make clojure the best choice for XML processing aswell!cheers
--
You received this message because you are subscribed to the Google Groups "Clojure Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to clojure-dev...@googlegroups.com.
To post to this group, send email to cloju...@googlegroups.com.
Visit this group at http://groups.google.com/group/clojure-dev.
For more options, visit https://groups.google.com/d/optout.
--
You are saying that consumers need to resolve prefixes in that given a position in the XML tree and a namespaced keyword (i.e. :xmlns/A) what does that resolve to at this position in the document? There's not a document-level only map of this resolution? I'm mostly trying to figure out how you're dealing with the same namespace defined multiple times (with different URIs) in a document.
I like this. So the user will be able to opt-in to getting full URIs rather than the namespaced keywords from the "parser". Although looking at the URIs is a bit ugly, I think it's the only way we can support both streaming and namespaces
- resolve-name: generate uri-namespaced name from prefixed name
Maybe this is the function that answers my question from step 1?
How does emitting XML that contains namespace work now in your fork?
While I have not looked at the code much, the words in Step 2 make me ask whether there should be a QualifiedName protocol that is extended to QName instead of a specific concrete type?
In Step 3, hearing tree walkers sounds like something that precludes streaming, which is something the current impl can do (and is really important). To what extent has that been affected for non-namespaced XML and to what extent is it now possible with namespaced XML?
Unfortunately for all of you, that
means I have opinions to impose on the conversation.
First, I very much like the idea of using the fully qualified URI in
tag names (and sometimes in attributes as well, but I'll come to that
in a sec). One key benefit is that a fully qualified node can be
lifted out of one XML snippet and dropped into another without
becoming undefined or incorrect. Another way of saying this is that
any two elements or trees of elements in Clojure would give the same
answer to Clojure's = as xpath's deep-equal function. The deep-equal
function (http://www.w3.org/TR/2005/CR-xpath-functions-20051103/#func-deep-equal)
is the most widely used sense of XML equality, and I think matching
that is a valuable property.
So, as soon as a code base starts manipulating any XML data that uses
just xmlns aliases (prefixes) without the full URI being attached, it
now has two *different kinds* of XML data that will behave
differently, respond to equality comparisons differently, etc. Note
that the "lifting" I referred to in the previous paragraph can be as
innocent an operation as taking an XML element and returning one of
its child-elements. So I'm nervous about *ever* producing aliased
(prefixed) tag names. If we must do so, I think there should be big
warnings in the docstrings of the related functions about their
unsuitability for use in various circumstances.
Attributes are a little different in that they live in a map and a
MapEntry is not a normal thing to pass around in an application.
Combined with the fact that the namespace of an attribute is almost
always the same as that of its element, perhaps the namespace of
attributes can be elided except when they differ from their element.
If this is done consistently, then we still have a canonical format
such that Clojure equality will match xpath equality.
Now for the element and attribute names themselves--why not continue to
use keywords? :http://foo.bar/namespace/tag is a valid Clojure
keyword, already has reader and printer support.
And perhaps more
importantly keywords already have alias support, so that we can use
Clojure's existing reader aliases to say ::foo/tag *anywhere* instead
of having to wrap uses of aliases in a call to a macro like
with-xmlns. This would also obviate the need for a new protocol of the
sort Alex suggested (XmlName). It is unfortunate that the original
alias or prefix supplied by a parsed document can't be hung off the
keyword itself since they don't support metadata, but I think metadata
on the element object itself would be sufficient to collect all the
relevant namespace prefixes used to support round-tripped XML reusing
the same alises as the original document. What do you think?
I totally second Chouser here and that's why the mapping between uris and attributes was put in metadat in http://dev.clojure.org/display/DXML/Fuller+XML+support (please see my recent comment as I recently tried to implement the proposal and hit a limiattion)
Attributes are even more special: unlike elements, non-prefixed attributes do not resolve to the default ns or even to their elemeent ns: they should stay unqualified.
I haven't checked recently (I have vague memories of having looked this up ages ago but don't remember the conclusion) Are all URIs (IRIs would be even cooler) readable as namespaces in a keyword?
Right now it's quite an exercise: you have to create a namespace with an unreadable symbol just to be able to create the alias (funny that you recently asked for being able to create aliases without namespaces :-))
Could you detail on why you thought the namespace info needs to live in metadata?
Concerning the case `multiple prefix <-> same uri`, my implementation uses the *outermost* prefix for an given uri, when emitting a *resolved* name. Other modes could be supported. Can you tell me of a use case, where one would want to emit a resolved name and still control the prefix on a per-name basis?
It's even worse: if the qualified name http://my-ns/foo/name will be represented as (keyword "http://my-ns/foo" "name"), then what is the qualified name DAV:propfind going to be?(keyword "DAV" "propfind")? (keyword "DAV:" "propfind")?
--Chouser
It's an expedite way to get them out of the equality scope :-) To me a namespace-aware tool should ignore aliases and xmlns attributes, focusing only on resolved names.
The two tiers of your model are : representation tier and model tier -- and only the model tier is (and should) be namespace aware. It should be made clear (through api more than through doc) to users that the representation tier should be avoided since you can emit broken XML.
Unused xmlns attributes do not effect equality (just like metadata).
In fact I don't think any xmlns attributes effect equality, except to
the extent that they're used to set the namespaces of other things.
Broken consumers working at the representation level :-) The "Fuller XML Support" proposal tried hard (too hard?) to get by using only default data structures. However something like QName solves most (all?) issues I had (the alias is part of the name but not used for equality).
Additionally a reader tag may be introduced.
First thing, a:name with a mapped to http://my-ns/foo DOES NOT resolve to http://my-ns/foo/nameit does resolve to the pair [http://my-ns/foo name]. But using / to separate namespaces from names adds more confusion.
=> :http://example.org/?stupid#test/name
:http://example.org/?stupid#test/name
=> [(namespace *1) (name *1)]
["http://example.org/?stupid#test" "name"]
DAV:propfind would just be :DAV/propfind -- no ambiguity: DAV has no scheme hence it's not a URI. (I'm still uneasy about shoehorning URIs in namespaces)
I really think a tagged literal with a custom (implementing Named) or existing (QName) type which embeds the alias without using it for equality may be a sensible middle ground.
I'm still concerned about the loss of composability when
some xml functions take or return one kind of xml and others another,
and the potential for confusion when these are sometimes but not
always compatible.
I think it's pretty common for a function to return an xml element or
fragment that is not at the root of the final document, and we should
strive to make it convenient for such code to be explicit about
namespaces rather than encourage it to rely on aliases provided in
some other lexical scope.
I'm not sure if I'm disagreeing with you on that point or not. :-)
It would be unfortunate for every literal element name in Clojure
source to require a full uri. It would also be unfortunate for any
alias or prefix that's used instead to be resolved in some kind of
reverse dynamic scope, depending on the xmlns maps set up in some xml
document where the literal at hand will eventually be embedded.
This leaves, I think, (1) keywords using standard Clojure ns aliases, (2)
wrapping every instance of a literal element name in a macro that
expands custom aliases, or (3) a tagged literal that is picking up on
some kind of alias-map state somewhere if such is possible.
> Prefixes should be controlled by setting regular xmlns attributes into theThis is a great peril and I hope we can find a way to make it very
> tree and and the user should use :prefix/names outside of a lexically
> enclosing :xmlns* at his own peril.
easy to avoid.
So even if we don't use keywords for element names, we can still use
the Clojure's regular alias mechanism. A tagged literal reader has
access to *ns* and it's alias map. This would allow something like:
(alias 'myns "http://long-url-to-my-ns") ;; hypothetical future alias fn
Then later in the same namespaces,
#xml/name html/h1
Could be read as {:uri "http://long-url-to-my-ns" :prefix "html" :name
"h1"} or whatever canonical representation we end up with.
Christophe, you are talking about the XPath model here, wich specifically ignores xmlns declarations. The XML Infoset (kind of the uber spec) counts xmlns attributes as regular attributeinfo nodes. (Yes, I've read up a bit, thanks chouser for giving me the push).
[attributes] An unordered set of attribute information items, one for each of the attributes (specified or defaulted from the DTD) of this element. Namespace declarations do not appear in this set. If the element has no attributes, this set has no members.
[namespace attributes] An unordered set of attribute information items, one for each of the namespace declarations (specified or defaulted from the DTD) of this element. Declarations of the form xmlns="" and xmlns:name="", which undeclare the default namespace and prefixes respectively, count as namespace declarations. Prefix undeclaration was added in Namespaces in XML 1.1. By definition, all namespace attributes (including those namedxmlns
, whose [prefix] property has no value) have a namespace URI ofhttp://www.w3.org/2000/xmlns/
. If the element has no namespace declarations, this set has no members.
--
What do you think of using namespaced keywords in place of :in-scope
and :namespace-attrs? :clojure.data.xml/in-scope, for example. My
thinking here is that these are not going to be typed out users very
often, and others might have reason to add metadata that we wouldn't
want to collide.
I second this argument.
Now, while your description of how to use aliases from other Clojure
namespaces help illustrate how the mechanism actually works, I think
we'd generally want to discourage this. Other aliases in Clojure are
private to the namespace declaring them, not part of their public API.
The full URI is the public interface for XML tags, and having each
namespace spend a line to declare its own alias would surely be worth
the decoupling that it buys.
I think your security concerns are interesting as well, and a wise
thing to consider early.
So to make sure I understand, we're talking
about untrusted data that may have once been a string (like
user-entered form data) but which has now been edn/read and is Clojure
data, right?
Then providing that as input to a resolving function
could return XML elements of various namespaces. But wouldn't such edn
data be able to specify any namespace it wants anyway?
One way to help plug any security hole that may exist there would be
to only do the alias resolution in the context of the #xml/name reader
macro, and not provide that to your edn reader when reading untrusted
data. When used this way, they could be called "literal resolved
names" or "tagged resolved names" or something, since they are fully
qualified resolved QName's by the time the reader is done with them.
They are much safer to use that the fully-raw names, so "pseudo-raw"
sounds perhaps scarier than necessary.
What bothers me is thet the distinction between raw and pseudo-raws names is not static: you cant' know whether :x/foo is raw or pseudo-raw without looking up for the namespace and then the mapping in the ns. Plus raw names prefixes may collide with single-segment (clojure) namespaces.
--