[xml3k Trac] #48: Complete implementation of filter-based parser capabilities

2 views
Skip to first unread message

akar...@xml3k.org

unread,
Jan 30, 2010, 9:09:43 PM1/30/10
to akar...@googlegroups.com
#48: Complete implementation of filter-based parser capabilities
----------------------------------------+-----------------------------------
Reporter: http://uche.myopenid.com/ | Owner: David Beazley <dave@…>
Type: defect | Status: new
Priority: major | Milestone: safe harbor
Component: amara | Version: 2.0
Keywords: parser, xml, expat, filter |
----------------------------------------+-----------------------------------
Amara was designed to enable a variety of facilities in streaming mode,
for performance. This requires some gymnastics dealing with the
[[http://expat.sourceforge.net/|expat]] parser (see the unfortunately huge
module http://bitbucket.org/uche/amara/src/tip/lib/src/expat/expat.c ) in
order to suspend and resume it to allow Amara perform concurrent
processing.

In general, this processing involves a
[[http://wiki.xml3k.org/Amara2/Architecture/Streaming_XPath|streamable
XPath subset]]. so that for example a user could register a pattern
"/a/b[attr='1']" which would process each b child of the root a element if
that child had an attribute with name "attr" and value "1".

The most generally useful instantiation of this idea is pushbind, where
Amara yields small document subtrees according to declared patterns. See
the [[http://wiki.xml3k.org/Amara/Manual#pushbind|documentation of
Pushbind for Amara 1]]. Pushbind is not yet implemented for Amara 2, and
doing should be a key outcome of this ticket.

Amara 1.x was not closely integration with the expat C parser, so some
pretty inefficient threading was required to send back the subtrees once
ready.

In Amara 2.x the low level expat handler now has the wherewithal to
suspend and resume at set boundaries of the input document, but this is
not yet hooked up to the node building machinery, and also does not
implement the state machine required for the XPath streamable subset
patterns.

Note: The main difference between Pushbind and ElementTree iterparse is
that the latter primarily sends events, and gives you the hooks to build
subtrees. Pushbind instead triggers on document patterns to automate the
subtree building for the user which is easier to use, and covers most real
cases I've seen.

--
Ticket URL: <http://trac.xml3k.org/ticket/48>
xml3k <www.xml3k.org>
XML3K is a loose collective of open-source projects focused on Python, XML, and RESTful design.

akar...@xml3k.org

unread,
Jan 30, 2010, 9:31:31 PM1/30/10
to akar...@googlegroups.com
#48: Complete implementation of pushbind and related filter-based parser
capabilities
----------------------------------------+-----------------------------------
Reporter: http://uche.myopenid.com/ | Owner: David Beazley <dave@…>
Type: defect | Status: new
Priority: major | Milestone: safe harbor
Component: amara | Version: 2.0
Keywords: parser, xml, expat, filter |
----------------------------------------+-----------------------------------
Description changed by http://uche.myopenid.com/:

Old description:
New description:
In addressing this issue the relevant code needs much better
documentation, and possibly other improvements. See related, catch-all
issue #22

--

--
Ticket URL: <http://trac.xml3k.org/ticket/48#comment:1>

akar...@xml3k.org

unread,
Mar 4, 2010, 9:34:17 AM3/4/10
to akar...@googlegroups.com
#48: Complete implementation of pushbind and related filter-based parser
capabilities
----------------------------------------+-----------------------------------
Reporter: http://uche.myopenid.com/ | Owner: David Beazley <dave@…>
Type: defect | Status: new
Priority: major | Milestone: safe harbor
Component: amara | Version: 2.0
Keywords: parser, xml, expat, filter |
----------------------------------------+-----------------------------------
Changes (by http://uche.myopenid.com/):

* cc: Andrew, Dalke, <dalke@…> (added)


Comment:

[http://groups.google.com/group/akara/msg/78d294030e596476?hl=en Rocco
Pigneri had some suggestions] made on the list. Quoting here:

> Here's a list of features that I would like to see in pushbind 2.0 that
I think would greatly increase its usefulness. These are in decreasing
order of importance:

> - It would be great if Amara automatically provided the XPath that
matched the node as an attribute on all returned nodes. That's really
what has started my own investigation of into pushbind.

Yes, we discussed this yesterday. It's bitten me, too. In effect Amara
1.x pushbind would lose the context of the pattern within which each chunk
was matched. So in the following case:


{{{
>>> XML="""\
... <doc>
... <one><a>0</a><a>1</a></one>
... <two><a>10</a><a>11</a></two>
... </doc>
... """
>>> import amara
>>> chunks = amara.pushbind(XML, u'a')
>>> a = chunks.next()
>>> print a
0
>>> print a.xml()
<a>0</a>
>>> a = chunks.next()
>>> print a.xml()
<a>1</a>
>>> a = chunks.next()
>>> print a.xml()
<a>10</a>
}}}

You lose the fact that the first two matches were in context /doc/one, and
that the third in context /doc/two

There are multiple ways to deal with this. One is to have bindery provide
the context in XPath form for each pushbind chunk, perhaps as a data
member on the instance, e.g. a.xml_pushbind_context or something

Another way is to send back the pushbind chunks in a subtree that includes
all ancestors (but no siblings or grand-siblings, of course). One nice
thing about this is that you can have access to e.g. attributes on
ancestor elements. This latter approach is what we were discussing
yesterday.

Biggest problem with that is control. What happens if the user mutates up
the tree? I would suggest that this is documented as something the user
should not do. We could just leave it so that a user ignoring that
warning could break that pushbind sequence. Or we could just always
discard and re-create each entire subtree each time, which would be
wasteful in that ancestor nodes would be recycled over and over.

David, I'm leaning to just treating the user as a consenting adult. We'll
warn them that if they mutate any ancestors they could break things, but
let's not recycle ancestor nodes each time.

> - Attributes should be treated equally to elements, especially for the
purposes of rules. It's a little silly (at least, as far as I can
understand) that I can associate rules with elements but not attributes.

This has always been more a matter of implementation convenience than a
fundamental design decision. We do intent to try to keep attributes
first-class this time, if we can.

One thing that would help from you is use-cases. Could you add to this
ticket some use-cases of rules and/or pushbind examples of things you've
wanted to do with attributes, but were prevented in Amara 1.x?

> - It might be nice to add a Sax.START_ATTRIBUTE/END_ATTRIBUTE event to
pushbind's parser in order to separate the element events from the
attribute events.

pushbind 2.x will be using the low-level tree builder, so this time it
will have access to the attribute "named node set" constructs provided by
the parser.

> I could get into some detail about implementing them as well, but I'll
wait to see if there's interest.

Well, considering the implementation is changing radically, you might want
to wait a bit. Also, how's your Python/C programming? :D

--
Ticket URL: <http://trac.xml3k.org/ticket/48#comment:2>

akar...@xml3k.org

unread,
Mar 5, 2010, 1:15:11 PM3/5/10
to akar...@googlegroups.com
#48: Complete implementation of pushbind and related filter-based parser
capabilities
----------------------------------------+-----------------------------------
Reporter: http://uche.myopenid.com/ | Owner: David Beazley <dave@…>
Type: defect | Status: new
Priority: major | Milestone: safe harbor
Component: amara | Version: 2.0
Keywords: parser, xml, expat, filter |
----------------------------------------+-----------------------------------

Comment(by https://www.google.com/accounts/o8/id?id=aitoawnfdtjjvsvjiec9huynk-adjuq1mgt39uc):

Hello, all!

I'm new to Trac so please forgive my newbie foibles :-).

Replying to [comment:2 http://uche.myopenid.com/]:

> > - It would be great if Amara automatically provided the XPath that
matched the node as an attribute on all returned nodes. That's really
what has started my own investigation of into pushbind.
>
> Yes, we discussed this yesterday. It's bitten me, too. In effect Amara
1.x pushbind would lose the context of the pattern within which each chunk
was matched.
>

> <snip>

>
> There are multiple ways to deal with this. One is to have bindery
provide the context in XPath form for each pushbind chunk, perhaps as a
data member on the instance, e.g. a.xml_pushbind_context or something

I am unfamiliar with what the interface for xml_pushbind_context would be
--I feel the above sentence implies that such an interface already exists
--but I think that this approach would require that the user learn a new
interface to understand the node's context. It could be difficult for
users to match a fragment with the XPath given the pushbind. I'm thinking
specifically of how you would match up
/root/element[@id='books']/subelement given that it could (technically) be
re-written as /root/element/subelement.


> Another way is to send back the pushbind chunks in a subtree that
includes all ancestors (but no siblings or grand-siblings, of course).
One nice thing about this is that you can have access to e.g. attributes
on ancestor elements. This latter approach is what we were discussing
yesterday.

What I really like about this approach is that it sounds like I could, for
free, do a frag.root.xml_xpath(xpath) match up to see which of the XPaths
that I passed into pushbind match the current fragment that I have been
given. This way, I can easily find out which chunk type I have. It also
wouldn't require the user to learn an new interface at all since the
returned fragment node and the parent node would use the same interface as
the other fragments.


> Biggest problem with that is control. What happens if the user mutates
up the tree? I would suggest that this is documented as something the
user should not do. We could just leave it so that a user ignoring that
warning could break that pushbind sequence. Or we could just always
discard and re-create each entire subtree each time, which would be
wasteful in that ancestor nodes would be recycled over and over.
>
> David, I'm leaning to just treating the user as a consenting adult.
We'll warn them that if they mutate any ancestors they could break things,
but let's not recycle ancestor nodes each time.

I'm fine with treating the users as consenting adults. After all, we
aren't building an end-user tool, now are we ;-)?


> > - Attributes should be treated equally to elements, especially for the
purposes of rules. It's a little silly (at least, as far as I can
understand) that I can associate rules with elements but not attributes.
>

> <snip>

>
> One thing that would help from you is use-cases. Could you add to this
ticket some use-cases of rules and/or pushbind examples of things you've
wanted to do with attributes, but were prevented in Amara 1.x?

The biggest one that I can think of involves the matching stated above.
If I pass in ['/root/element@id', '/root/element/children'] to pushbind, I
then wanted to have a rule for each XPath that would associate all matched
fragments with the XPath that produced them. In other words, if a node
matched '/root/element@id', then that string would be added to that
document fragment before pushbind returned it.

I will add other use cases as I run into them. I am using Amara rather
extensively this week so I'm sure that I'll run into at least one more use
case ;-).

--
Ticket URL: <http://trac.xml3k.org/ticket/48#comment:3>

Uche Ogbuji

unread,
Mar 5, 2010, 3:54:51 PM3/5/10
to Luis Miguel Morillas, akar...@googlegroups.com
Hi Luis,

On Fri, Mar 5, 2010 at 11:15 AM, <akar...@xml3k.org> wrote:
Comment(by https://www.google.com/accounts/o8/id?id=aitoawnfdtjjvsvjiec9huynk-adjuq1mgt39uc):

Any thoughts about getting a friendly user name/e-mail address from OpenID metadata, so we're not stuck with ugly URLs such as the above?

Thanks.

--
Uche Ogbuji                       http://uche.ogbuji.net
Founding Partner, Zepheira        http://zepheira.com
Linked-in profile: http://www.linkedin.com/in/ucheogbuji
Articles: http://uche.ogbuji.net/tech/publications/
TNB: http://www.thenervousbreakdown.com/author/uogbuji/
Friendfeed: http://friendfeed.com/uche
Twitter: http://twitter.com/uogbuji
http://www.google.com/profiles/uche.ogbuji

akar...@xml3k.org

unread,
Mar 5, 2010, 8:07:09 PM3/5/10
to akar...@googlegroups.com
#48: Complete implementation of pushbind and related filter-based parser
capabilities
----------------------------------------+-----------------------------------
Reporter: http://uche.myopenid.com/ | Owner: David Beazley <dave@…>
Type: defect | Status: new
Priority: major | Milestone: safe harbor
Component: amara | Version: 2.0
Keywords: parser, xml, expat, filter |
----------------------------------------+-----------------------------------

Comment(by https://www.google.com/accounts/o8/id?id=aitoawnfdtjjvsvjiec9huynk-adjuq1mgt39uc):

One thing I forgot to add: it would also be great if, in Amara 2, the user
could add custom pushbind rules without needing to know pushbind's
internals intimately ;-). Here are some examples of what I think would be
useful:

1. Allow rules to match both elements and attributes.
2. Allow rules to operate on a set of elements that is orthogonal (i.e.
different) from the ones that will match pushbind. More on this later.
3. Refactor the event model such that the user doesn't have to add,
manage, and remove three different rules in order to process the element.
An implementation that I think would be very clear would be to have rule
classes implement three callbacks, one for each stage of processing the
XML (i.e. start, end, and character).
4. Make it easy for the user to override pushbind and to pass through to
pushbind depending upon whether they want to override or to augment
pushbind's functionality. I think explicit calls here would be most
useful. For example, in order to implement element_skeleton_rule, the
user would explicitly call pushbind in the start and end methods in order
to ensure that the proper XML skeleton nodes are created and wired.
However, in the char method, the user would explicitly not call pushbind
and would simply provide "blank" values in order to ensure that the XML
bodies are not filled in.
5. Allow the user direct access to the object that will become/is/was the
processed node and allow the user both to mutate the object and to replace
it with an object of another type if so desired. In other words, for the
start event, give the user access to the skeleton object that may become
the final output. For end, pass in the final product. Passing in the
binder object is silly (at least, in my experience) because the binder
provides too much access: when I write a rule, I really just want to deal
with the object that the rule matches. Perhaps its children are
interesting as well, but I'm not so sure.
6. No longer expose priority to the user. I think that refactoring the
events as described above will remove most of the need for an externally
available priority.

For point two, I find myself thinking of two different use cases. In the
first case, I was thinking of manually changing the XPath's for attributes
from /root/element@attr to /root/element when assigning them to rules and
then modifying @attr as it came down the pike in pushbind. This need
would obviously be moot if point one were implemented. The second case
where I think that point two is useful are situations where a user wants
to augment the parent nodes of a fragment returned by pushbind. This case
is only useful were we to go with the approach above where the "context"
elements for a fragment are implemented as full XML objects.

In summary, these ideas are pretty rough, and there is a lot of detail
that needs to be hashed out. However, the overall functional requirements
are, in my mind, pretty sound and are features that I think would greatly
increase the value of Amara to users.

--
Ticket URL: <http://trac.xml3k.org/ticket/48#comment:4>

akar...@xml3k.org

unread,
Mar 6, 2010, 11:28:51 AM3/6/10
to akar...@googlegroups.com
#48: Complete implementation of pushbind and related filter-based parser
capabilities
----------------------------------------+-----------------------------------
Reporter: http://uche.myopenid.com/ | Owner: David Beazley <dave@…>
Type: defect | Status: new
Priority: major | Milestone: safe harbor
Component: amara | Version: 2.0
Keywords: parser, xml, expat, filter |
----------------------------------------+-----------------------------------

Comment(by http://uche.myopenid.com/):

>>> Rocco
>> Uche
> Rocco


> > > - It would be great if Amara automatically provided the XPath that
matched the node as an attribute on all returned nodes. That's really
what has started my own investigation of into pushbind.
> >
> > Yes, we discussed this yesterday. It's bitten me, too. In effect
Amara 1.x pushbind would lose the context of the pattern within which each
chunk was matched.
> >
> > <snip>
> >
> > There are multiple ways to deal with this. One is to have bindery
provide the context in XPath form for each pushbind chunk, perhaps as a
data member on the instance, e.g. a.xml_pushbind_context or something
>
> I am unfamiliar with what the interface for xml_pushbind_context would

be--I feel the above sentence implies that such an interface already
exists

I'm not sure you feel that. I said "perhaps as a...". No such thing
exists yet.



> > Another way is to send back the pushbind chunks in a subtree that
includes all ancestors (but no siblings or grand-siblings, of course).
One nice thing about this is that you can have access to e.g. attributes
on ancestor elements. This latter approach is what we were discussing
yesterday.
>
> What I really like about this approach is that it sounds like I could,
for free, do a frag.root.xml_xpath(xpath) match up to see which of the
XPaths that I passed into pushbind match the current fragment that I have
been given. This way, I can easily find out which chunk type I have.

It would be frag.xml_root.xml_select(xpath) now, but yes.



> > > - Attributes should be treated equally to elements, especially for
the purposes of rules. It's a little silly (at least, as far as I can
understand) that I can associate rules with elements but not attributes.
> >
> > <snip>
> >
> > One thing that would help from you is use-cases. Could you add to
this ticket some use-cases of rules and/or pushbind examples of things
you've wanted to do with attributes, but were prevented in Amara 1.x?
>
> The biggest one that I can think of involves the matching stated above.
If I pass in ['/root/element@id', '/root/element/children'] to pushbind, I
then wanted to have a rule for each XPath that would associate all matched
fragments with the XPath that produced them. In other words, if a node
matched '/root/element@id', then that string would be added to that
document fragment before pushbind returned it.

I don't see why you couldn't do that with an XPath expression within the
fragment context after the match. As long as you have the full, matched
subtree for each match result, as discussed above.

The more I think of it, the more I don't see how it makes sense to support
matches for attribute nodes. Just starting with the implementation
basics, the parser sends attributes all at once, along with the start
element event. So the implementation would have to do some inefficient
slicing and dicing, just to support something for which I can't see a use
case.

If you do come up with a use-case, preferably with full, clear pseudo-
code, please let us know.



> One thing I forgot to add: it would also be great if, in Amara 2, the
user could add custom pushbind rules without needing to know pushbind's
internals intimately ;-). Here are some examples of what I think would be
useful:

The user will need to understand the parser/builder model in order to
write their own filters. I don't think we can get around that.



> 2. Allow rules to operate on a set of elements that is orthogonal (i.e.
different) from the ones that will match pushbind. More on this later.

That will come for free now that we are turning pushbind into a peer of
other filters (at least in the sense of applying the filter, not the API
sense) rather than the base layer for the other rules.



> 3. Refactor the event model such that the user doesn't have to add,
manage, and remove three different rules in order to process the element.
An implementation that I think would be very clear would be to have rule
classes implement three callbacks, one for each stage of processing the
XML (i.e. start, end, and character).

The filter will probably have access to the parse events within the
matching subtree, so yes, it would probably be a matter of parser event
call-backs. But there still has to be a way to say e.g. "don't build this
element because I've omitted it", etc. The user will need to learn the
conventions for doing that. There is no way to avoid new conventions for
that.



> 4. Make it easy for the user to override pushbind and to pass through
to pushbind depending upon whether they want to override or to augment
pushbind's functionality. I think explicit calls here would be most
useful. For example, in order to implement element_skeleton_rule, the
user would explicitly call pushbind in the start and end methods in order
to ensure that the proper XML skeleton nodes are created and wired.
However, in the char method, the user would explicitly not call pushbind
and would simply provide "blank" values in order to ensure that the XML
bodies are not filled in.

There is a trade-off here between having one filter have to rely on
another filter, and choreographing state between two filters. We're
probably goign to go for the latter approach. Therefore an
element_skeleton_rule filter would not explicitly call pushbind, but would
rather set a state that allows the overall chain of filters to trickle all
the way down to pushbind.

Otherwise you end up with the problem of filter dependency management and
all that. Yuck!



> 5. Allow the user direct access to the object that will become/is/was
the processed node and allow the user both to mutate the object and to
replace it with an object of another type if so desired. In other words,
for the start event, give the user access to the skeleton object that may
become the final output. For end, pass in the final product. Passing in
the binder object is silly (at least, in my experience) because the binder
provides too much access: when I write a rule, I really just want to deal
with the object that the rule matches. Perhaps its children are
interesting as well, but I'm not so sure.

Again this whole thing is switching to a SAX-like model, so, for example,
a filter will not have access to children of its current context upon
invocation. The filter will just have all the access to normal parser
events.



> 6. No longer expose priority to the user. I think that refactoring the
events as described above will remove most of the need for an externally
available priority.

If you don't have some sort of priority, you rely on the user to order the
filters correctly. It's a tricky matter of defaults versus warnings
versus just documentation. I think only use-cases will determine how to
resolve that.

--
Ticket URL: <http://trac.xml3k.org/ticket/48#comment:5>

akar...@xml3k.org

unread,
Mar 6, 2010, 11:44:17 AM3/6/10
to akar...@googlegroups.com
#48: Complete implementation of pushbind and related filter-based parser
capabilities
----------------------------------------+-----------------------------------
Reporter: http://uche.myopenid.com/ | Owner: David Beazley <dave@…>
Type: defect | Status: new
Priority: major | Milestone: safe harbor
Component: amara | Version: 2.0
Keywords: parser, xml, expat, filter |
----------------------------------------+-----------------------------------

Comment(by http://uche.myopenid.com/):

>> Rocco
> Andrew Dalke (on list)

> I'm the one who'll likely implement the pushbind/xpath code.

Yep. Conclusion from the summit is that Andrew will work on the code to
translate the streamable XPath subset into a state management mechanism
for the parser/builder. David Beazley will work on the parser/builder
system


>> It would be great if Amara automatically provided the XPath that
matched the node as an attribute on all returned nodes. That’s really
what has started my own investigation of into pushbind.

> Could you provide an example of how you would want to use that? I ask
because some things are ambiguous:

> XPath: a//b given <a><a><b/></a></a>

> Does it matter which "<a>" is matched?

> Xpath: (*[@name]|a|b) given <a name="blah"/>

> Does it matter which XPath node is matched?

> I would also like to know what information you want from the XPath term?
The relevant XPath substring? The equivalent transformed string? (Eg, if
there is some normalization so that (a[@b]|a) gets converted to a simple
"a".) Or do you want the parsed XPath grammar node?

> Do you want the specific attribute test which passed, in the case of
"*[@a='1' or @b='2']"?

> Setting the mapping as an attribute of a node will cause problems. It
means the nodes are mutable, so nodes can't be saved with the presumption
that the mapping is permanent. Specifically, a future match can override
the attribute definition from an earlier match.

> There are other possibilities than associating each matched node with
its corresponding XPath term. Regular expressions use a different solution
by using group numbers. Perhaps something similar could work here;
although I haven't thought about what that API might be.

> Working this out would require some examples of what you want to
achieve, since perhaps it's not needed at all, or it can be solved through
filter actions.

Note that I advocated, and Rocco seems to be satisfied with replacing this
option with the fact that the user will get the ancestor nodes of the
matched subtree as well. They can then always use
{{{amara.xpath.util.abs_path(match_result_node)}}}, or some other tool for
analysis.



>> Attributes should be treated equally to elements, especially for the
purposes of rules. It’s a little silly (at least, as far as I can
understand) that I can associate rules with elements but not attributes.

> I'm going to infer you want something like the ability to have

> <a id="123" price='65.25'>

> and specify the id attribute should be saved as ".id" but the price
should be ignored? Would a filter rule action which maps XML attribute
name(s) to Python attribute names suffice? That is likely feasible.

That to me should be implemented in a rule on elements, not attributes.
Attributes have a special, very tightly coupled relationship to their
owner elements in the XML space, and this is reflected in the basic
pragmatism of how parser APIs are implemented. I would see this as
something like:

{{{
omit_attribute_rule(u'a', attrs=[u'price'])
}}}

Where you're free to e.g. replace u'a' with u'*'


>> It might be nice to add a Sax.START_ATTRIBUTE/END_ATTRIBUTE event to
pushbind’s parser in order to separate the element events from the
attribute events.

> I do not understand this. The SAX2 API does not have a
START_ATTRIBUTE/END_ATTRIBUTE only start and end of elements. Is this
something in Amara1 which I don't know enough about?

Well, considering he is advocating '''adding''' these of course it's not
something Amara 1.x has ;)

Again I'm not at all likely to go for this, and I need use-cases to
illustrate alternate/better approaches.

--
Ticket URL: <http://trac.xml3k.org/ticket/48#comment:6>

akar...@xml3k.org

unread,
Mar 15, 2010, 7:14:45 PM3/15/10
to akar...@googlegroups.com
#48: Complete implementation of pushbind and related filter-based parser
capabilities
----------------------------------------+-----------------------------------
Reporter: http://uche.myopenid.com/ | Owner: David Beazley <dave@…>
Type: defect | Status: new
Priority: major | Milestone: safe harbor
Component: amara | Version: 2.0
Keywords: parser, xml, expat, filter |
----------------------------------------+-----------------------------------

Comment(by https://www.google.com/accounts/o8/id?id=aitoawnfdtjjvsvjiec9huynk-adjuq1mgt39uc):

Replying to [comment:5 http://uche.myopenid.com/]:
> >>> Rocco
> >> Uche
> > Rocco
> Uche

First of all, my apologies for taking so long to reply. Last week was
pretty busy, and this is really the first day that I could peel away and
address these issues. Great questions, however! Thanks for helping vet
my ideas :-).

> > > > - Attributes should be treated equally to elements, especially for

the purposes of rules. It's a little silly (at least, as far as I can
understand) that I can associate rules with elements but not attributes.
> > >

> > > <snip>
> > >
> > > One thing that would help from you is use-cases. Could you add to
this ticket some use-cases of rules and/or pushbind examples of things
you've wanted to do with attributes, but were prevented in Amara 1.x?
> >
> > The biggest one that I can think of involves the matching stated
above. If I pass in ['/root/element@id', '/root/element/children'] to
pushbind, I then wanted to have a rule for each XPath that would associate
all matched fragments with the XPath that produced them. In other words,
if a node matched '/root/element@id', then that string would be added to
that document fragment before pushbind returned it.
>
> I don't see why you couldn't do that with an XPath expression within the
fragment context after the match. As long as you have the full, matched
subtree for each match result, as discussed above.

That is very much a possibility. There was unfortunately little
documentation available when I was writing my custom rules, and it's very
possible that I could do everything I have outlined above without any
changes to Amara, just with changes to my understanding :-). There is a
trade-off here between cleaner interface and increased documentation. I
will say that the current interface was clean enough (mostly) although
some simplification would make the product easier to use (more on that at
point four).


> The more I think of it, the more I don't see how it makes sense to
support matches for attribute nodes. Just starting with the
implementation basics, the parser sends attributes all at once, along with
the start element event. So the implementation would have to do some
inefficient slicing and dicing, just to support something for which I
can't see a use case.

I'm not certain that we should limit ourselves to what is easy to do with
the libraries powering Amara or even what is most efficient. It seems to
me that Amara really differentiates itself by its ease of use. Improving
the ease of use would only enhance the product. Also, from having peered
a little bit into Amara, it seems that what I am asking for is completely
reasonable; it just isn't all exposed in a way that is very easy to access
(or easy to understand for a newbie like myself :-).

I will say this in defense of matching attributes: if pushbind won't do it
for me (let's use your assumptions above), then how can I do this as an
end user? Could you guys provide tools to parse an XPath and provide a
new one that matches the attribute's parent element? This is particularly
important if the user has a higher-level attribute match that only filters
the grandparent elements. What about providing me with XPaths that match
only the attribute that I want, using that attribute's parent as a base?
I would need this information to match the attribute that I want from the
base element. How do we handle scenarios when the user wants an attribute
only when another attribute test passes on that same parent element? You
can see how throwing this back to the user quickly becomes overwhelming.


> If you do come up with a use-case, preferably with full, clear pseudo-
code, please let us know.

I guess I feel that my functional explanation in my past post had done
that. Would code postings of my past attempts and some explanations of
how they failed (in my very uneducated eyes) be more useful? Please keep
in mind that what is driving my need is a functional user requirement: my
XPath queries are given to me by my users at run time. I cannot simply be
smart about how I parse the XML because I have no a priori knowledge.
Please also keep in mind that I also do not want to parse XPaths myself
for reasons given above.

I have quite a few to post so if you have any suggestions of how to
present them, I would welcome them.


> > One thing I forgot to add: it would also be great if, in Amara 2, the
user could add custom pushbind rules without needing to know pushbind's
internals intimately ;-). Here are some examples of what I think would be
useful:
>
> The user will need to understand the parser/builder model in order to
write their own filters. I don't think we can get around that.

I'll respond to this with point 4 below.


> > 3. Refactor the event model such that the user doesn't have to add,
manage, and remove three different rules in order to process the element.
An implementation that I think would be very clear would be to have rule
classes implement three callbacks, one for each stage of processing the
XML (i.e. start, end, and character).
>
> The filter will probably have access to the parse events within the
matching subtree, so yes, it would probably be a matter of parser event
call-backs. But there still has to be a way to say e.g. "don't build this
element because I've omitted it", etc. The user will need to learn the
conventions for doing that. There is no way to avoid new conventions for
that.

It sounds to me like we are in agreement here; if we are not, please speak
up. Since it seems to me like only the developers of Amara know this
interface, I'm not too worried about changing these conventions :-).


> > 4. Make it easy for the user to override pushbind and to pass through
to pushbind depending upon whether they want to override or to augment
pushbind's functionality. I think explicit calls here would be most
useful. For example, in order to implement element_skeleton_rule, the
user would explicitly call pushbind in the start and end methods in order
to ensure that the proper XML skeleton nodes are created and wired.
However, in the char method, the user would explicitly not call pushbind
and would simply provide "blank" values in order to ensure that the XML
bodies are not filled in.
>
> There is a trade-off here between having one filter have to rely on
another filter, and choreographing state between two filters. We're

probably going to go for the latter approach. Therefore an

element_skeleton_rule filter would not explicitly call pushbind, but would
rather set a state that allows the overall chain of filters to trickle all
the way down to pushbind.
>
> Otherwise you end up with the problem of filter dependency management
and all that. Yuck!

This is a very good point, and I can see how a dependency-based
organization scheme could become a mess. Given that, I think that it is
still valuable to reduce clutter and provide the rule writer with objects
at an appropriate level of detail. As a naive developer who learned all
he knows by reading the code, there seemed to be a lot of stuff that I had
to learn in order to write a really simple rule. The binder object,
binder.event_completely_handled, the fact that the apply function is a
rule that actually adds three other rules in order to parse the XML
element itself (why I have to do this for the same event as the same
saxtools.EVENT that the rule itself is written for?), and the fact that
XML attributes are accessible only via their parent element, not objects
themselves as pushbind provides are all implementation details that I
don't need to know if I simply want to mutate a single element or
attribute.

In addition, getting some of this information out of Amara was also really
hard (if I have an element object, how do I get the xpattern_attr_wrapper?
Through xml_attributes? _attributes? xpathAttributes? And what are the
interfaces for these functions (or dictionaries)? What if the namespace
and name of the attribute are provided by the user at run-time? Can Amara
parse those out of the XPath for me? If I modify one of these
xpattern_attr_wrapper objects on a parent element, why is the object
passed out of Amara not modified? And, finally, why do I have to refer to
everything via the namespace and name of the attribute that I want? Why
can't I just provide the attribute name?).

Now, before anyone interprets the above as a rant, I want to be very clear
that I am EXTREMELY NAIVE about Amara and 4Suite, and that I didn't get
much help from the community because we are all focusing on 2.0. However,
I do feel that those comments elucidate the user experience of writing a
custom rule with the current level of documentation. As a naive user, I
hope that my experiences help make this product more open for all people
to use and easier to expand (potentially for you guys as well :-).

What would have been ideal for me would have been to say, "Pass me the
bindery and xpattern_attr_wrapper objects that match this XPath after you
have done processing them but before you give them to the user." I could
then modify them in a function that I wrote myself, and then pass that
object back away. It was rather daunting to dig into all these other
ideas just to implement a rule like that.

I will also state that a little bit of documentation here could have gone
a long way and might in fact be the better solution, assuming there are
significant technical benefits to the current rules system. I would be
more than willing to help beta test some documentation if that's what we
perceive as the best way to go forward.


> > 5. Allow the user direct access to the object that will become/is/was
the processed node and allow the user both to mutate the object and to
replace it with an object of another type if so desired. In other words,
for the start event, give the user access to the skeleton object that may
become the final output. For end, pass in the final product. Passing in
the binder object is silly (at least, in my experience) because the binder
provides too much access: when I write a rule, I really just want to deal
with the object that the rule matches. Perhaps its children are
interesting as well, but I'm not so sure.
>
> Again this whole thing is switching to a SAX-like model, so, for
example, a filter will not have access to children of its current context
upon invocation. The filter will just have all the access to normal
parser events.

I again think that we are in agreement here. I'll just add a nuance--that
you probably already understand--that my goal here is not to remove
functionality but to simply repackage it so that the user does not have to
understand the binder element, the binder.event_completely_handled
variable, rule.priority, and a few other internals, or need to add rules
to implement rules. It would be great if we could simply pass around
objects akin to the final product with as many blanks filled in as
possible. I'm not saying to change the event model; I'm just saying to
make it easier to access for external developers.


> > 6. No longer expose priority to the user. I think that refactoring
the events as described above will remove most of the need for an
externally available priority.
>
> If you don't have some sort of priority, you rely on the user to order
the filters correctly. It's a tricky matter of defaults versus warnings
versus just documentation. I think only use-cases will determine how to
resolve that.

This is best answered by my response to point four.

One final language question: I have been using the word "rules" because
that's what's in the Amara documentation, but I see that you keep using
the word "filter"? Is there a nuance that I am ignorant of, or are these
simply synonyms?

Andrew, your points are also great, but I will have to reply to them
tomorrow. It's a little late for me here.

--
Ticket URL: <http://trac.xml3k.org/ticket/48#comment:7>

akar...@xml3k.org

unread,
Mar 18, 2010, 2:56:24 PM3/18/10
to akar...@googlegroups.com
#48: Complete implementation of pushbind and related filter-based parser
capabilities
----------------------------------------+-----------------------------------
Reporter: http://uche.myopenid.com/ | Owner: David Beazley <dave@…>
Type: defect | Status: new
Priority: major | Milestone: safe harbor
Component: amara | Version: 2.0
Keywords: parser, xml, expat, filter |
----------------------------------------+-----------------------------------

Comment(by https://www.google.com/accounts/o8/id?id=aitoawnfdtjjvsvjiec9huynk-adjuq1mgt39uc):

Replying to [comment:6 http://uche.myopenid.com/]:

>>> Rocco
>> Andrew Dalke (on list)

> Uche

Dear Andrew,

Thanks for the questions. Sorry that my answers are so slow, but we had
another "all-hands" moment in the office over the last two days. I'll go
through these one by one.


>>> It would be great if Amara automatically provided the XPath that
matched the node as an attribute on all returned nodes. That’s really
what has started my own investigation of into pushbind.
>>
>> Could you provide an example of how you would want to use that? I ask
because some things are ambiguous:
>>
>> XPath: a//b given <a><a><b/></a></a>
>>
>> Does it matter which "<a>" is matched?

Nope. In my case, I just want to match a list of XPaths given by the user
with the nodes that Amara returns. Any and all pattern matches are used
to identify which section of user code gets executed next. In other
words, whether I get a/a/b or just a/b is unimportant. However, that a//b
matched and a//c did not match is important.


>> Xpath: (*[@name]|a|b) given <a name="blah"/>
>>
>> Does it matter which XPath node is matched?

Again, nope. I just want to know that the given XML node matches a given
XPath. If I (really, my users) need to differentiate, they could do it
via other means (i.e. providing three XPaths or providing user code to
differentiate).


>> I would also like to know what information you want from the XPath
term? The relevant XPath substring? The equivalent transformed string?
(Eg, if there is some normalization so that (a[@b]|a) gets converted to a
simple "a".) Or do you want the parsed XPath grammar node?

Hmm. Interesting question. For my particular use case, I really just
want a way to pass in the user's XPath strings one by one and see which
one matches the current document fragment. In other words, if pushbind
returns XML fragments that match both '/root/elem1' and
'/root/elem2/elem1', I want to be able to say
frag.doesMatch('/root/elem1') or frag.doesMatch('/root/elem2/elem1').
Same with attributes--in fact, attribute matches are equally important
since we do have some memory performance use cases where attribute
processing is critical.

For now, just matching the XPath strings against the returned document
fragments suffices. That approach avoids lots of hairy string
normalization issues. However, if there is a way to turn an XPath into a
Python object that can be compared for equality, then we might have a few
different options. If I could create a list of these objects from their
XPath strings, I could then just check the attribute on the returned
fragment for equality. As long as the fragment knows not to serialize its
XPath attribute (because it only reports where the XML was found by
pushbind, not where it might have mutated to) then we should be all set.
This might even provide future Amara users with a lot more power since
they might have access to each XPath term (i.e. the namespace of the
query, the parent element of the match, attribute filters on the
grandparent of the match, etc.) and therefore avoid repeating this
conversation :-).

An important wrinkle here is that if my user gives me two overlapping
XPaths (let's say /root/elem[attr='5']/elem1 and /root/elem/elem1), I want
fragments that match the first pattern to match *both* patterns when I use
the doesMatch function. For fragments that only match the second one,
only the second should match. This requirement is important to me because
my users get to match functionality to XPaths and such a differentiation
allows them to "augment" the processing of very specific nodes in an easy
to state manner.

In terms of having access to an XPath grammar node, I am not familiar
enough to know if I want that or not. Perhaps this is akin to my "Python
XPath object" above :-)?


>> Do you want the specific attribute test which passed, in the case of
"*[@a='1' or @b='2']"?

Again, it's only important to me that I can match up the XML node to the
XPath that caused pushbind to output it. However, if pushbind were to
return an attribute, I should be able to match that attribute in the
*exact* same manner as I match the elements.


>>> Attributes should be treated equally to elements, especially for the
purposes of rules. It’s a little silly (at least, as far as I can
understand) that I can associate rules with elements but not attributes.

>> I'm going to infer you want something like the ability to have
>>
>> <a id="123" price='65.25'>
>>
>> and specify the id attribute should be saved as ".id" but the price
should be ignored? Would a filter rule action which maps XML attribute
name(s) to Python attribute names suffice? That is likely feasible.
>That to me should be implemented in a rule on elements, not attributes.
Attributes have a special, very tightly coupled relationship to their
owner elements in the XML space, and this is reflected in the basic
pragmatism of how parser APIs are implemented. I would see this as
something like:
>
>{{{
>omit_attribute_rule(u'a', attrs=[u'price'])
>}}}
>
>Where you're free to e.g. replace u'a' with u'*'

What I would be interested in doing in this situation is asking pushbind
for a/@id and then having a rule execute on a/@id (in this case, to add
the users' XPath to the id attribute object). So, it's not that I am
trying to ignore price as it is that I am only interested in id. Changing
attribute matches to be matches on their parent elements is right now a
non-starter because I would have to parse the XPath very heavily by hand
in order to separate the attribute portion from the element portion and
then augment the attributes as they come down the pike. I really couldn't
figure out how to get Amara to do that. I'll speak more about that in my
use cases (coming next!) since there's quite a few of them.

This does bring up an interesting question: could we translate searches
for '/root/element/@attr' into a search for '/root/element' and then
ignore all of '/root/element's children? If we were to go down that path,
is there any way to search for '/root/element<rule: ignore children>' and
'/root/element/subelement' in the same pushbind statement? Or does the
ignore rule automatically assume that we will never be processing the
child elements?

Thanks, Andrew, for your questions. Use cases next, in one BIG post!

--
Ticket URL: <http://trac.xml3k.org/ticket/48#comment:8>

akar...@xml3k.org

unread,
Mar 18, 2010, 6:10:39 PM3/18/10
to akar...@googlegroups.com
#48: Complete implementation of pushbind and related filter-based parser
capabilities
----------------------------------------+-----------------------------------
Reporter: http://uche.myopenid.com/ | Owner: David Beazley <dave@…>
Type: defect | Status: new
Priority: major | Milestone: safe harbor
Component: amara | Version: 2.0
Keywords: parser, xml, expat, filter |
----------------------------------------+-----------------------------------

Comment(by https://www.google.com/accounts/o8/id?id=aitoawnfdtjjvsvjiec9huynk-adjuq1mgt39uc):

So, here are all of my rule use cases, in the order in which they were
tried. My intention when I was creating these was to exploit the fact
that rules are applied to the XPaths they match; therefore, I would pass
in a single XPath to each instance of my rule and then have the rule
append that XPath to the object that it matched. That way, my upstream
code could match the document fragments with the XPath that identified
them and pick the right segment of user code to run based upon the match.
I basically hopped from idea to idea until I plain and simply ran out of
things to try. This all represents about a week of work as I spent a lot
of time learning how Amara works while I did this.

First, my test XML is the following:

{{{
#!xml
<doc>
<data event="1">
<headerData>...</headerData>
</data>
<data event="2">
<headerData>...</headerData>
<basic>
<id>1</id> ...
</basic>
<payload>...</payload>
<footerData>...</footerData>
</data>
<data event="3">
<headerData>...</headerData>
<basic>
<id>1</id> ...
</basic>
</data>
<data event="4">
<headerData>...</headerData>
<basic>
<id>1</id> ...
</basic>
<payload>...</payload>
<footerData>...</footerData>
</data>
<data event="5">
<headerData>...</headerData>
<basic>
<id>1</id> ...
</basic>
<payload>...</payload>
<footerData>...</footerData>
</data>
<data event="6">
<headerData>...</headerData>
<basic>
<id>1</id> ...
</basic>
</data>
</doc>
}}}

The ellipses represent large amounts of test XML data.

Second, here is my test harness:

{{{
#!python
import amara
import amara.bindery
import amara.binderytools
import amara.saxtools
import Ft

patterns = ['/doc/data/@event', \
'/doc/data/basic']

rules = [add_xpath(x) for x in patterns]

for frag in amara.pushbind('BW-CDR-20080124161500-2-8455830d-015781.xml',
\
patterns,
rules=rules):
if isinstance(frag, amara.binderyxpath.xpath_attr_wrapper):
print "%s: %s" %(frag.nodeName, frag.nodeValue)
else:
print "ID: ", frag.id
}}}

Both the test data and the test harness were invariant for these all these
use cases.

So, here's my first use case:

{{{
#!python
class add_element_xpath(amara.binderytools.xpattern_rule_base):
event_type = amara.saxtools.END_ELEMENT
priority = 10

def apply(self, binder):
if not self.match(binder):
return
# I don't want to make this check because I *always* want
to
# modify the created node.
# if binder.event_completely_handled:
# #Then another rule has already created an instance
# return

# Add the XPath information!
binder.binding_stack[amara.bindery.TOP].xml_node_xpatterns
= self.xpatterns

return
}}}

This worked perfectly for elements; in fact, I even ended up using this
later. However, it failed on attributes with the following error:

{{{
Traceback (most recent call last):
File "parser.py", line 43, in ?
rules=rules):
File "C:\Program Files (x86)\Python-2.4.4\lib\site-
packages\amara\binderytools.py", line 737, in pushbind
return parser.parse(source)
File "C:\Program Files (x86)\Python-2.4.4\lib\site-
packages\amara\binderytools.py", line 838, in startElementNS
self._chunk_consumer(attr)
File "C:\Program Files (x86)\Python-2.4.4\lib\site-
packages\amara\binderytools.py", line 715, in handle_chunk
parser.setProperty(Sax.PROPERTY_YIELD_RESULT, node)
SystemError: parser suspended
}}}

There was no way around this error. I tried reordering the attributes,
reordering the XML--so it hit an element match before any attribute
matches--and tried executing with more or less XPaths. No avail. In
order to move forward, I made the following change to the
handle_chunk(node) function located in the binderytools.py file at line
717:

{{{
#!python
def handle_chunk(node):
>>> if not parser.getProperty(Sax.PROPERTY_YIELD_RESULT): # Added this
line!!!
parser.setProperty(Sax.PROPERTY_YIELD_RESULT, node)
}}}

That at least avoided the crash and allowed me to try a few other things.

My second attempt was the following (changed lines prefixed by ">>>"):

{{{
#!python
class add_element_xpath(amara.binderytools.xpattern_rule_base):
>>> event_type = amara.saxtools.START_ELEMENT
priority = 10

def apply(self, binder):
if not self.match(binder):
return
# Skipping self.event_completely_handled check

# Add the XPath information!
binder.binding_stack[amara.bindery.TOP].xml_node_xpatterns
= self.xpatterns

return
}}}

I basically changed the event_type from END_ELEMENT to START_ELEMENT since
that point is where Amara instantiates the attribute objects and attaches
them to the node before calling self.apply_rules() again. When I ran this
rule, it would never match (because attributes are never given to rules to
match them against in the first place).

In my third attempt, I tried a number of things based upon how I best
thought I could work with the startElementNS function of the chunk_handler
class of binderytools.py. The goal in this attempt was to split the
attribute-matching XPaths into two parts: the first part matches the
element that contains the attribute, and the second matches just the
attribute itself (assuming that the base of the search is the parent
element). The rule would match the first XPath and then use the second
XPath on the matched element to find and modify the appropriate attribute.
Here is what I wrote to do that:

{{{
#!python
class add_xpath(amara.binderytools.xpattern_rule_base):
event_type = amara.saxtools.START_ELEMENT
priority = 10

def apply(self, binder):
if not self.match(binder):
return
# Skipping self.event_completely_handled check

if <magic to determine if we are targeting an element or attribute>:
# Element case
binder.binding_stack[amara.bindery.TOP].xml_brain_xpatterns =
self.xpatterns
else: # Using the element search to really find an attribute under
that element.
# Attribute case
binder.binding_stack[amara.bindery.TOP].attributes[<attribute tuple-
tuple J>].xml_brain_xpatterns = self.xpatterns
return
}}}

I had a few difficulties here. One was that I couldn't figure out if
attributes of xpath_attributes was the more appropriate attribute to use
on the fragment. The second is the difficulty in identifying if an XPath
were targeting an element or an attribute because of the sophisticated
parsing involved. This is particuarly frustrating because Amara already
does a lot of this and because it just doesn't feel quite right. Thirdly,
the attributes dictionary interface is very difficult to use when you are
only given an XPath: how do you pull out the namespace of the attribute
and the element from the XPath (and what is the difference between the
namespace in the tuple and the namespace outside of it)? Finally, even
when I jerry-rigged the rule to pull out only one specific attribute, the
changes that I made to the object were never reflected in the fragment
that pushbind would return. I even once tried modifying Amara to attach a
reference to that object directly to the binder object: no dice. All in
all, I just couldnt' get it to work, even as a prototype.

Finally, after spending more time in startElementNS, I tried changing the
event_type to START_DOCUMENT in order to see if that would work since
apply_rules is called right after appending all of the xpath_attr_class
objects to the current element (changed lines prefixed by ">>>"):

{{{
#!python
class add_element_xpath(amara.binderytools.xpattern_rule_base):
>>> event_type = amara.saxtools.START_DOCUMENT
priority = 10

def apply(self, binder):
if not self.match(binder):
return
# Skipping self.event_completely_handled check

# Add the XPath information!
binder.binding_stack[amara.bindery.TOP].xml_node_xpatterns
= self.xpatterns

return
}}}

This did nothing as apparently the binder object is completely cleared
after every START_ELEMENT event.

In conclusion, the root difficulty is that even though attribute objects
are created for attributes that match XPaths, they are not easily
accessibly to the rule writer. Unlike elements, Amara appends these
attributes to their parent element's object rather than handing them
directly to the rule itself. There are quite a few places where we are
really close to providing the rule writer with all the right data, but I
just couldn't quite figure out how to bridge that gap. This could
possibly be accomplished by exposing a few XPath interfaces, changing the
rule interface a little, or maybe even just through some tweaking of the
XML node objects themselves, or perhaps there is a completely different
approach that would make us both happy. I'm more than open to suggestions
since you guys know the product much better than I :-).

Please tell me if there is anything that I can do to help you guys
understand these use cases more or to understand better my perspective and
experiences.

Thank you,

Rocco

--
Ticket URL: <http://trac.xml3k.org/ticket/48#comment:9>

akar...@xml3k.org

unread,
Mar 22, 2010, 6:10:23 PM3/22/10
to akar...@googlegroups.com
#48: Complete implementation of pushtree and related filter-based parser
capabilities
----------------------------------------+-----------------------------------
Reporter: http://uche.myopenid.com/ | Owner: David Beazley <dave@…>
Type: defect | Status: new
Priority: major | Milestone: safe harbor
Component: amara | Version: 2.0
Keywords: parser, xml, expat, filter |
----------------------------------------+-----------------------------------

Old description:
> In addressing this issue the relevant code needs much better
> documentation, and possibly other improvements. See related, catch-all
> issue #22

New description:

Amara was designed to enable a variety of facilities in streaming mode,
for performance. This requires some gymnastics dealing with the
[http://expat.sourceforge.net/ expat] parser (see the unfortunately huge
module http://bitbucket.org/uche/amara/src/tip/lib/src/expat/expat.c ) in
order to suspend and resume it to allow Amara perform concurrent
processing.

In general, this processing involves a
[http://wiki.xml3k.org/Amara2/Architecture/Streaming_XPath streamable
XPath subset]. so that for example a user could register a pattern
"/a/b[attr='1']" which would process each b child of the root a element if
that child had an attribute with name "attr" and value "1".

The most generally useful instantiation of this idea is pushbind (which
will be renamed pushtree in 2.x), where Amara yields small document
subtrees according to declared patterns. See the
[http://wiki.xml3k.org/Amara/Manual#pushbind documentation of Pushbind for
Amara 1]. Pushtree is not yet implemented for Amara 2, and doing should
be a key outcome of this ticket.

Amara 1.x was not closely integration with the expat C parser, so some
pretty inefficient threading was required to send back the subtrees once
ready.

In Amara 2.x the low level expat handler now has the wherewithal to
suspend and resume at set boundaries of the input document, but this is
not yet hooked up to the node building machinery, and also does not
implement the state machine required for the XPath streamable subset
patterns.

Note: The main difference between Pushbind/Pushtree and ElementTree
iterparse is that the latter primarily sends events, and gives you the
hooks to build subtrees. Pushtree instead triggers on document patterns
to automate the subtree building for the user which is easier to use, and
covers most real cases I've seen.

In addressing this issue the relevant code needs much better
documentation, and possibly other improvements. See related, catch-all
issue #22

See also: http://wiki.xml3k.org/Akara/Dev/Amara%202.x%20Pushtree%20plus

--

Comment(by http://uche.myopenid.com/):

Rocco,

I do appreciate all that work, but I think you misunderstood what we need
right now for use-cases :)

Amara 2.x is a huge change, and I think it might confuse things a bit to
express use-cases purely in 1.x terms.

I think I thought up an example where attribute patterns make sense. I've
written it up here:

http://wiki.xml3k.org/Akara/Dev/Amara%202.x%20Pushtree%20plus#Whentheuseronlycaresaboutattributes

Notice how simply I expressed it: Sample XML, a bit of pseudocode (i.e.
guessing the API) and a brief, prose description of behavior. Please let
me know whether that captures the use of attribute matching well enough
for you.

Also, if you have any other use-cases you can express similarly, please
contribute them on this ticket, and we can discuss them for addition on
that page.

Thanks.

--
Ticket URL: <http://trac.xml3k.org/ticket/48#comment:10>

akar...@xml3k.org

unread,
Mar 22, 2010, 6:13:38 PM3/22/10
to akar...@googlegroups.com
#48: Complete implementation of pushtree and related filter-based parser
capabilities
----------------------------------------+-----------------------------------
Reporter: http://uche.myopenid.com/ | Owner: David Beazley <dave@…>
Type: defect | Status: new
Priority: major | Milestone: safe harbor
Component: amara | Version: 2.0
Keywords: parser, xml, expat, filter |
----------------------------------------+-----------------------------------

Comment(by http://uche.myopenid.com/):

BTW, Rocco, if you'd like to help, I could really use a hand porting test
cases from Amara 1.x Pushbind. Could you help? If so, it would probably
be in a couple of weeks or so, once we have a basic handle on the 2.x API
(i.e. after more description on this ticket and the wiki page).

--
Ticket URL: <http://trac.xml3k.org/ticket/48#comment:11>

akar...@xml3k.org

unread,
Mar 23, 2010, 12:57:49 PM3/23/10
to akar...@googlegroups.com
#48: Complete implementation of pushtree and related filter-based parser
capabilities
----------------------------------------+-----------------------------------
Reporter: http://uche.myopenid.com/ | Owner: David Beazley <dave@…>
Type: defect | Status: new
Priority: major | Milestone: safe harbor
Component: amara | Version: 2.0
Keywords: parser, xml, expat, filter |
----------------------------------------+-----------------------------------

Comment(by https://www.google.com/accounts/o8/id?id=aitoawnfdtjjvsvjiec9huynk-adjuq1mgt39uc):

Ah, now I understand why we kept talking past each other. Now I feel a
little silly :-).

So, I have taken a look at the use cases that we have, and I have a few
comments.

First, "Keeping the ancestral context of matches" looks good. My only
comment is that we didn't explicitly call out the ability to say
a.matches('/doc/one/a') or something similar. This was the main benefit--
in my eyes--of this approach.

"When the user only cares about attributes" looks OK, but I think that I
might not have gotten my initial concern across completely. Let's try the
following use case. Let's imagine that we have the following XML file
that represents a security log for a server, organized by actor:

{{{
#!xml
<securityLog>
<actor name="Bob">
<readAction>
<filename>...</filename>
<timestamp>...</timestamp>
</readAction>
<writeAction>...</writeAction>
<readAction><!-- Organized just like above.
--></readAction>
<executeAction>...</executeAction>
. . . <!-- All three of these entry types are repeated for
as many
actions as this actor has. This list can become extremely
large. -->
</actor>
<actor name="Francis">
. . . <!-- Just like above. -->
</actor>
<!-- Billions of more actors. -->
</securityLog>
}}}

In this case, we want to output a simple table with three columns: the
name of the actor and the filename and timestamp of every readAction. In
this case, we cannot match the actor tag because it is too large.
Instead, we will match the name attribute and the readAction element:

{{{
#!python
name = None
for frag in arama.pushtree(['/securityLog/actor/@name',
'/securityLog/actor/readAction']):
if frag.matches('/securityLog/actor/@name'):
name = frag.nodeValue
else:
print name, frag.filename, frag.timestamp
}}}

You can easily imagine how this would generalize for more XPaths.

There are a few more variations on this theme that I will consider over
this week to see if we should include them too.

With regards to stripping out uninteresting data, that comment was driven
by my desire to find a workaround for dealing with the rules engine.
Given that, I'm not certain if you guys want to keep it in the use case or
not. It is a nice to have, but it's most certainly not critical to me.

As for the test cases, what is your schedule? When do you anticipate
needing someone to start on them? And when you say "a couple of weeks",
what kind of schedule are you looking at? Three weeks, part time, three
weeks, full time? Three weeks at 2 hours a day? I'm just trying to get
an idea of what I am volunteering for to ensure that I can deliver :-).

Take care,

Rocco

--
Ticket URL: <http://trac.xml3k.org/ticket/48#comment:12>

akar...@xml3k.org

unread,
Mar 29, 2010, 12:54:51 PM3/29/10
to akar...@googlegroups.com
#48: Complete implementation of pushtree and related filter-based parser
capabilities
----------------------------------------+-----------------------------------
Reporter: http://uche.myopenid.com/ | Owner: David Beazley <dave@…>
Type: defect | Status: new
Priority: major | Milestone: safe harbor (i.e. 2.0 final)
Component: amara | Version: 2.0
Keywords: parser, xml, expat, filter |
----------------------------------------+-----------------------------------

Comment(by https://www.google.com/accounts/o8/id?id=aitoawnfdtjjvsvjiec9huynk-adjuq1mgt39uc):

I thought over the variants and didn't really find anything different
enough from the above example to really warrant a separate use case. The
exclusion example came close (i.e. the amara.tree.strip language found
near the end of
http://wiki.xml3k.org/Akara/Dev/Amara%202.x%20Pushtree%20plus#Whentheuseronlycaresaboutattributes)
simply because there is no good way to create a "match all elements under
/doc/tag not named <x>" XPath query from '/doc/tag/x'. However, I
couldn't really find a good example that captured that succinctly.

Hope that helps,

Rocco

--
Ticket URL: <http://trac.xml3k.org/ticket/48#comment:13>

akar...@xml3k.org

unread,
Mar 30, 2010, 5:52:14 PM3/30/10
to akar...@googlegroups.com
#48: Complete implementation of sendtree/itertree (former pushbind) and related
filter-based parser capabilities
----------------------------------------+-----------------------------------
Reporter: http://uche.myopenid.com/ | Owner: David Beazley <dave@…>
Type: defect | Status: new
Priority: major | Milestone: safe harbor (i.e. 2.0 final)
Component: amara | Version: 2.0
Keywords: parser, xml, expat, filter |
----------------------------------------+-----------------------------------

Old description:

> Amara was designed to enable a variety of facilities in streaming mode,
> for performance. This requires some gymnastics dealing with the
> [http://expat.sourceforge.net/ expat] parser (see the unfortunately huge
> module http://bitbucket.org/uche/amara/src/tip/lib/src/expat/expat.c ) in
> order to suspend and resume it to allow Amara perform concurrent
> processing.
>
> In general, this processing involves a
> [http://wiki.xml3k.org/Amara2/Architecture/Streaming_XPath streamable
> XPath subset]. so that for example a user could register a pattern
> "/a/b[attr='1']" which would process each b child of the root a element
> if that child had an attribute with name "attr" and value "1".
>
> The most generally useful instantiation of this idea is pushbind (which
> will be renamed pushtree in 2.x), where Amara yields small document
> subtrees according to declared patterns. See the
> [http://wiki.xml3k.org/Amara/Manual#pushbind documentation of Pushbind
> for Amara 1]. Pushtree is not yet implemented for Amara 2, and doing
> should be a key outcome of this ticket.
>
> Amara 1.x was not closely integration with the expat C parser, so some
> pretty inefficient threading was required to send back the subtrees once
> ready.
>
> In Amara 2.x the low level expat handler now has the wherewithal to
> suspend and resume at set boundaries of the input document, but this is
> not yet hooked up to the node building machinery, and also does not
> implement the state machine required for the XPath streamable subset
> patterns.
>
> Note: The main difference between Pushbind/Pushtree and ElementTree
> iterparse is that the latter primarily sends events, and gives you the
> hooks to build subtrees. Pushtree instead triggers on document patterns
> to automate the subtree building for the user which is easier to use, and
> covers most real cases I've seen.
>
> In addressing this issue the relevant code needs much better
> documentation, and possibly other improvements. See related, catch-all
> issue #22
>
> See also: http://wiki.xml3k.org/Akara/Dev/Amara%202.x%20Pushtree%20plus

New description:

Amara was designed to enable a variety of facilities in streaming mode,
for performance. This requires some gymnastics dealing with the
[http://expat.sourceforge.net/ expat] parser (see the unfortunately huge
module http://bitbucket.org/uche/amara/src/tip/lib/src/expat/expat.c ) in
order to suspend and resume it to allow Amara perform concurrent
processing.

In general, this processing involves a
[http://wiki.xml3k.org/Amara2/Architecture/Streaming_XPath streamable
XPath subset]. so that for example a user could register a pattern
"/a/b[attr='1']" which would process each b child of the root a element if
that child had an attribute with name "attr" and value "1".

The most generally useful instantiation of this idea is pushbind (which
will be renamed pushtree in 2.x), where Amara yields small document
subtrees according to declared patterns. See the
[http://wiki.xml3k.org/Amara/Manual#pushbind documentation of Pushbind for
Amara 1]. Pushtree is not yet implemented for Amara 2, and doing should
be a key outcome of this ticket.

Amara 1.x was not closely integration with the expat C parser, so some
pretty inefficient threading was required to send back the subtrees once
ready.

In Amara 2.x the low level expat handler now has the wherewithal to
suspend and resume at set boundaries of the input document, but this is
not yet hooked up to the node building machinery, and also does not
implement the state machine required for the XPath streamable subset
patterns.

Note: The main difference between Pushbind/Pushtree and ElementTree
iterparse is that the latter primarily sends events, and gives you the
hooks to build subtrees. Pushtree instead triggers on document patterns
to automate the subtree building for the user which is easier to use, and
covers most real cases I've seen.

In addressing this issue the relevant code needs much better
documentation, and possibly other improvements. See related, catch-all
issue #22

See also: http://wiki.xml3k.org/Akara/Dev/Amara%202.x%20Itertree%20plus

--

Comment(by http://uche.myopenid.com/):

First of all, a general note that DavidB and I discussed naming and
decided on a variant, `itertree`, which is basically like the previous
pushbind or short-lived pushtree. And then a more explicit couroutine
form sendtree.

As such, the main page has been renamed again, to:

http://wiki.xml3k.org/Akara/Dev/Amara%202.x%20Itertree%20plus

Rocco,

Sorry I'm just getting back into the swing after a bunch of travel. Your
use-case makes sense. I'll add that as an official case. Thanks.

Thanks for considering helping. I'd say a good time to start getting help
would be late April, maybe the week of the 19th or 26th.

I hadn't meant level of effort when I said "a couple of weeks". My guess
would be that it would take 4-8 hours total, and could easily be broken up
into chunks over a week or two. So let's just say for an example worst
case one week at 2 hours per day.

--
Ticket URL: <http://trac.xml3k.org/ticket/48#comment:14>

akar...@xml3k.org

unread,
Apr 17, 2010, 9:19:16 PM4/17/10
to akar...@googlegroups.com
#48: Complete implementation of sendtree/itertree (former pushbind) and related
filter-based parser capabilities
----------------------------------------+-----------------------------------
Reporter: http://uche.myopenid.com/ | Owner: David Beazley <dave@…>
Type: defect | Status: new
Priority: major | Milestone: safe harbor (i.e. 2.0 final)
Component: amara | Version: 2.0
Keywords: parser, xml, expat, filter |
----------------------------------------+-----------------------------------

Comment(by https://www.google.com/accounts/o8/id?id=aitoawmp5wr39rlphmmlfzinzws-p1j-o7tf_lm):

(Andrew Dalke wrote this. Ignore the long Google login id.)

I'm coming into this last with no experience with Amara. While I've read
the discussion here and elsewhere I'm still trying to make my way through
the differences of what was done, what's desired, and what can be done.

I'll start with the xpath expression used to define (since I don't know a
better term), "records." These are the items returned by iterating over
the input stream.

David: Attributes should be treated equally to elements, especially for
the purposes of rules. It's a little silly (at least, as far as I can
understand) that I can associate rules with elements but not attributes.

Uche: This has always been more a matter of implementation convenience
than a fundamental design decision. We do intent to try to keep attributes
first-class this time, if we can.

From what I can tell, there's been no decision on if Amara2 must support
something like itertree(input, "@href"), which returns an iterator over
all 'href' attributes as xpath_attr_wrapper instances, or if it's
acceptable to return only iterators over elements.

There are some conveniences to the latter. If we build on a SAX-style
interface then each startElement event can have either 0 or 1 matches. If
we support attributes as well then there each startElement event can have
up to N+1 matches, where N is the number of attributes. This complicates
state keeping in the parser, which is why I would rather not have to
support it.

There's a minor point that using Python's SAX API is dependent on
dictionary order for the attributes, which is implementation defined. I
don't know if this variability can affect things. In any case, perhaps the
expat API does preserve order? We are building on the expat API, right?

If attributes are important enough to handle as first-class items, does
the same hold for processing-instructions and comments?

--
Ticket URL: <http://trac.xml3k.org/ticket/48#comment:15>
xml3k <www.xml3k.org>
XML3K is a loose collective of open-source projects focused on Python, XML, and RESTful design.

--
You received this message because you are subscribed to the Google Groups "Akara Developers" group.
To post to this group, send email to akar...@googlegroups.com.
To unsubscribe from this group, send email to akara-dev+...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/akara-dev?hl=en.

akar...@xml3k.org

unread,
Apr 17, 2010, 11:47:23 PM4/17/10
to akar...@googlegroups.com
#48: Complete implementation of sendtree/itertree (former pushbind) and related
filter-based parser capabilities
----------------------------------------+-----------------------------------
Reporter: http://uche.myopenid.com/ | Owner: David Beazley <dave@…>
Type: defect | Status: new
Priority: major | Milestone: safe harbor (i.e. 2.0 final)
Component: amara | Version: 2.0
Keywords: parser, xml, expat, filter |
----------------------------------------+-----------------------------------

Comment(by http://uche.myopenid.com/):

Andrew, as far as matching on attributes go, that bit is already settled,
and the "official" use-cases reflect that. You are not clear as to
whether you have a specific problem with what is illustrated in that use-
case.

http://wiki.xml3k.org/Akara/Dev/Amara%202.x%20Itertree%20plus#Whentheuseronlycaresaboutattributes

And yes, processing-instructions and comments should be accessible using
their XPath node tests.

BTW, as promised I've added a use-case you can take as a guide to the
necessary performance characteristics, and can work into a full-on
benchmark, if need be, for testing.

http://wiki.xml3k.org/Akara/Dev/Amara%202.x%20Itertree%20plus#TEIBibliographyextraction

--
Ticket URL: <http://trac.xml3k.org/ticket/48#comment:16>
xml3k <www.xml3k.org>
XML3K is a loose collective of open-source projects focused on Python, XML, and RESTful design.

akar...@xml3k.org

unread,
Apr 18, 2010, 3:17:49 AM4/18/10
to akar...@googlegroups.com
#48: Complete implementation of sendtree/itertree (former pushbind) and related
filter-based parser capabilities
----------------------------------------+-----------------------------------
Reporter: http://uche.myopenid.com/ | Owner: David Beazley <dave@…>
Type: defect | Status: new
Priority: major | Milestone: safe harbor (i.e. 2.0 final)
Component: amara | Version: 2.0
Keywords: parser, xml, expat, filter |
----------------------------------------+-----------------------------------

Comment(by https://www.google.com/accounts/o8/id?id=aitoawmp5wr39rlphmmlfzinzws-p1j-o7tf_lm):

I carefully read that #Whentheuseronlycaresaboutattributes clause and I
wrote my previous comment because as a spec it is ambiguous. It says there
are three solutions

1) read the "person" tag and in user-space extract the right data. But
that reads too much data, so

2) "This might seem to be a good case for an attribute match", with an
outline of how that would be implemented

3) use solution #1 along with a filter which tell the parser to ignore the
other data.

The use of the words "might ... perhaps ..." is ambiguous. It either means
"the #2 approach is a possibility you may consider using" or "while #2 is
abstractly possible, you should use the #3 approach instead."

I will now interpret it as a requirement.

Part of my confusion comes from my difficulty in judging the authority
level, as it were, of that wiki page. It's not structured as if it were
definitive since says things like "to be determined" and "probably want to
translate these into sendtree form." It's therefore a working draft, which
I have to read it in the context of this trac thread in order to know how
to interpret it?

-- Andrew

--
Ticket URL: <http://trac.xml3k.org/ticket/48#comment:17>
xml3k <www.xml3k.org>
XML3K is a loose collective of open-source projects focused on Python, XML, and RESTful design.

akar...@xml3k.org

unread,
Apr 19, 2010, 10:37:12 AM4/19/10
to akar...@googlegroups.com
#48: Complete implementation of sendtree/itertree (former pushbind) and related
filter-based parser capabilities
----------------------------------------+-----------------------------------
Reporter: http://uche.myopenid.com/ | Owner: David Beazley <dave@…>
Type: defect | Status: new
Priority: major | Milestone: safe harbor (i.e. 2.0 final)
Component: amara | Version: 2.0
Keywords: parser, xml, expat, filter |
----------------------------------------+-----------------------------------

Comment(by https://www.google.com/accounts/o8/id?id=aitoawnfdtjjvsvjiec9huynk-adjuq1mgt39uc):

Replying to [comment:17 https://www.google.com/accounts/o8/id?id
=aitoawmp5wr39rlphmmlfzinzws-p1j-o7tf_lm]:
> 2) "This might seem to be a good case for an attribute match", with an
outline of how that would be implemented
>
> 3) use solution #1 along with a filter which tell the parser to ignore
the other data.
>
> The use of the words "might ... perhaps ..." is ambiguous. It either
means "the #2 approach is a possibility you may consider using" or "while
#2 is abstractly possible, you should use the #3 approach instead."

Since I was the one who asked for this use case, I'll weigh in with a
clarification. #2 is the '''real''' use case, in my eyes. #3 was
actually an aside that I made in a discussion of rules that made its way
into the use case. While I don't have any problems with implementing #3,
#2 is really where the real value is, in my mind.

Hope that helps,

Rocco

--
Ticket URL: <http://trac.xml3k.org/ticket/48#comment:18>
xml3k <www.xml3k.org>
XML3K is a loose collective of open-source projects focused on Python, XML, and RESTful design.

akar...@xml3k.org

unread,
Apr 23, 2010, 12:39:24 PM4/23/10
to akar...@googlegroups.com
#48: Complete implementation of sendtree/itertree (former pushbind) and related
filter-based parser capabilities
----------------------------------------+-----------------------------------
Reporter: http://uche.myopenid.com/ | Owner: David Beazley <dave@…>
Type: defect | Status: new
Priority: major | Milestone: safe harbor (i.e. 2.0 final)
Component: amara | Version: 2.0
Keywords: parser, xml, expat, filter |
----------------------------------------+-----------------------------------

Comment(by http://uche.myopenid.com/):

David and Andrew have completed a first, very preliminary pass at
implementing the basic pattern-based rule push-parse. The empirical
performance is quite impressive. I went ahead and pushed the code because
in my testing it break nothing that war working prior.

Note: in the end we renamed sendtree to pushtree, since it has been
generalized to handle any sort of callback, rather than just coroutines.

In the following section, 2 out of the 3 code examples work in current
trunk, and are good ways to explore the API, limitations and performance.

http://wiki.xml3k.org/Akara/Dev/Amara%202.x%20Itertree%20plus#Exercises

--
Ticket URL: <http://trac.xml3k.org/ticket/48#comment:19>
xml3k <www.xml3k.org>
XML3K is a loose collective of open-source projects focused on Python, XML, and RESTful design.

akar...@xml3k.org

unread,
Jan 13, 2011, 3:51:10 PM1/13/11
to akar...@googlegroups.com
#48: Complete implementation of sendtree/itertree (former pushbind) and related
filter-based parser capabilities
----------------------------------------+-----------------------------------
Reporter: http://uche.myopenid.com/ | Owner: David Beazley <dave@…>
Type: defect | Status: closed
Priority: major | Milestone: safe harbor (i.e. 2.0 final)
Component: amara | Version: 2.0
Resolution: fixed | Keywords: parser, xml, expat, filter
----------------------------------------+-----------------------------------
Changes (by http://uche.myopenid.com/):

* status: new => closed
* resolution: => fixed


Comment:

This is complete, and has been available in trunk for a while. Not
everything brought up in this thread has been resolved. For further
discussion as modifications to the first iteration, open additional
tickets on Foundry: foundry.zepheira.com/projects/akara/

--
Ticket URL: <http://trac.xml3k.org/ticket/48#comment:20>

akar...@xml3k.org

unread,
Jan 13, 2011, 5:04:33 PM1/13/11
to akar...@googlegroups.com
#48: Complete implementation of sendtree/itertree (former pushbind) and related
filter-based parser capabilities
----------------------------------------+-----------------------------------
Reporter: http://uche.myopenid.com/ | Owner: David Beazley <dave@…>
Type: defect | Status: closed
Priority: major | Milestone: safe harbor (i.e. 2.0 final)
Component: amara | Version: 2.0
Resolution: fixed | Keywords: parser, xml, expat, filter
----------------------------------------+-----------------------------------

Comment(by http://uche.myopenid.com/):

That should have been:

http://foundry.zepheira.com/projects/akara/

I also want to point out that the tutorial has been updated to cover this:

http://wiki.xml3k.org/Amara/Tutorial#Incremental_parsing

And the test suite is at:

https://github.com/zepheira/amara/tree/master/test/pushtree

--
Ticket URL: <http://trac.xml3k.org/ticket/48#comment:21>

Reply all
Reply to author
Forward
0 new messages