Amara 3 basic/experimental XML parser available. Help wanted.

38 views
Skip to first unread message

Uche Ogbuji

unread,
Jun 17, 2014, 11:33:37 AM6/17/14
to ak...@googlegroups.com
Hi folks,

I've actually made more progress than I've expected on Amara 3 (makes a huge difference when I need it for work). We now have a rudimentary MicroXML parser implemented. Reminder: the only built-in parser for Amara 3 will be for MicroXML. Full/legacy XML support will be via a pyexpat utility to convert from XML to MicroXML.

I ended up going with a hand-crafted parser. Yes I've been telling people not to write their own XML parser for over 15 years, but MicroXML is much simplified so that if you have a lot of experience, it's not as daunting a task. James Clark wrote a hand-crafted JS parser and that inspired me to do the same in Python. It's very fast and has some features that are very rare in XML parsers, such as the ability to feed in the XML input little by little (e.g. reading from a socket or stream and parsing incrementally). Also, it's designed around coroutines, so you can get SAX-like efficiency without the difficult call-back architecture SAX requires. Overall, it's designed to be as pythonic as possible.

I'm really excited about all this, and motivated to keep on going, but I could use some help, especially in testing, error handling and i18n. Please let me know if you have some cycles to volunteer.

The code is here. Python 3.4+ is required, as well as the base amara3-iri package ( pip install "amara3-iri<=3.0.0a1" ):


A quick example:

Python 3.4.1 (default, Jun 15 2014, 22:45:52) 
[GCC 4.2.1 Compatible Apple Clang 2.1 (tags/Apple/clang-163.7.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from amara3.uxml.parser import parsefrags
>>> events = parsefrags(['<hello', '><bold>world</bold></hello>'])
>>> for ev in events: print(ev)
... 
(<event.start_element: 1>, 'hello', {}, [])
(<event.start_element: 1>, 'bold', {}, ['hello'])
(<event.characters: 3>, 'world')
(<event.end_element: 2>, 'bold', ['hello'])
(<event.end_element: 2>, 'hello', [])
>>> 

Notice how start & end element events also include the stack of ancestor elements, a detail that takes a burden off the calling code.


--
Uche Ogbuji                                       http://uche.ogbuji.net
Founding Partner, Zepheira                  http://zepheira.com
Author, _Ndewo, Colorado_                 http://uche.ogbuji.net/ndewo/
Founding editor, Kin Poetry Journal      http://wearekin.org
http://copia.ogbuji.net    http://www.linkedin.com/in/ucheogbuji    http://twitter.com/uogbuji

Sylvain Hellegouarch

unread,
Jun 17, 2014, 11:37:20 AM6/17/14
to ak...@googlegroups.com
Hi Uche,

Great news.


On 17 June 2014 17:33, Uche Ogbuji <uc...@ogbuji.net> wrote:
Hi folks,

I've actually made more progress than I've expected on Amara 3 (makes a huge difference when I need it for work). We now have a rudimentary MicroXML parser implemented. Reminder: the only built-in parser for Amara 3 will be for MicroXML. Full/legacy XML support will be via a pyexpat utility to convert from XML to MicroXML.

I ended up going with a hand-crafted parser. Yes I've been telling people not to write their own XML parser for over 15 years, but MicroXML is much simplified so that if you have a lot of experience, it's not as daunting a task. James Clark wrote a hand-crafted JS parser and that inspired me to do the same in Python. It's very fast and has some features that are very rare in XML parsers, such as the ability to feed in the XML input little by little (e.g. reading from a socket or stream and parsing incrementally). 

This is brilliant. I had used such a mechanism in my old XML library [1] so that I could use it to parse XMPP streams [2]. I will happily try it with MicroXML some day :)

--
- Sylvain
http://www.defuze.org
http://twitter.com/lawouach

Uche Ogbuji

unread,
Jun 17, 2014, 11:51:27 AM6/17/14
to ak...@googlegroups.com
On Tue, Jun 17, 2014 at 9:37 AM, Sylvain Hellegouarch <s...@defuze.org> wrote:
This is brilliant. I had used such a mechanism in my old XML library [1] so that I could use it to parse XMPP streams [2]. I will happily try it with MicroXML some day :)


Excellent! Yeah I think we've chatted about this, and I've wanted this capability for a long time, but it's very tricky to do with expat/C, so I'm really glad that I can now architect the new parser around it. It's another reason why I went with a hand-crafted parser. Most parser generators assume an architecture of consuming all the input at one go, and they also consume a lot of memory with the state stack, and I wanted to inline things in order to avoid this.

Luis Miguel Morillas

unread,
Jun 17, 2014, 11:56:46 AM6/17/14
to ak...@googlegroups.com
Oh, cool,

I'll help with some cases for your test suite.


Saludos,

-- luismiguel (@lmorillas)
> --
> You received this message because you are subscribed to the Google Groups
> "akara" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to akara+un...@googlegroups.com.
> To post to this group, send email to ak...@googlegroups.com.
> Visit this group at http://groups.google.com/group/akara.
> For more options, visit https://groups.google.com/d/optout.

Uche Ogbuji

unread,
Jun 17, 2014, 12:05:58 PM6/17/14
to ak...@googlegroups.com
On Tue, Jun 17, 2014 at 9:56 AM, Luis Miguel Morillas <mori...@gmail.com> wrote:
Oh, cool,

I'll help with some cases for your test suite.

Thanks! The test suite so far is here:


It uses py.test

Erroneous documents aren't very well tested, and it would be good to get many more patterns of well-formed docs tested as well.

Please feel free to fork and issue pull requests with any progress: the Github way.


Sylvain Hellegouarch

unread,
Jun 17, 2014, 4:15:57 PM6/17/14
to ak...@googlegroups.com


On 17 June 2014 18:05, Uche Ogbuji <uc...@ogbuji.net> wrote:
On Tue, Jun 17, 2014 at 9:56 AM, Luis Miguel Morillas <mori...@gmail.com> wrote:
Oh, cool,

I'll help with some cases for your test suite.

Thanks! The test suite so far is here:


It uses py.test

Erroneous documents aren't very well tested, and it would be good to get many more patterns of well-formed docs tested as well.

Please feel free to fork and issue pull requests with any progress: the Github way.



Just cloned it and tried running the tests but got this:

>   from amara3.uxml.parser import parse, parser, event
../../.venvs/default34/lib/python3.4/site-packages/amara3/uxml/parser.py:13: in <module>
>   from amara3.util import coroutine
E   ImportError: No module named 'amara3.util'

Is there a missing package?

Uche Ogbuji

unread,
Jun 17, 2014, 4:55:27 PM6/17/14
to ak...@googlegroups.com
Yep. You need to first do:

pip install "amara3-iri<=3.0.0a1"

(Or equivalent.) I mentioned it in my earlier message, but I'll confirm that it's in the README.md

Good luck.


Sylvain Hellegouarch

unread,
Jun 17, 2014, 4:58:26 PM6/17/14
to ak...@googlegroups.com
My bad. sorry.


--
You received this message because you are subscribed to the Google Groups "akara" group.
To unsubscribe from this group and stop receiving emails from it, send an email to akara+un...@googlegroups.com.
To post to this group, send email to ak...@googlegroups.com.
Visit this group at http://groups.google.com/group/akara.
For more options, visit https://groups.google.com/d/optout.

Sylvain Hellegouarch

unread,
Jun 17, 2014, 5:00:04 PM6/17/14
to ak...@googlegroups.com
Well unfortunately, it didn't help.

pip install "amara3-iri<=3.0.0a1"
Downloading/unpacking amara3-iri<=3.0.0a1
  Downloading amara3-iri-3.0.0a1.tar.gz (40kB): 40kB downloaded
  Running setup.py (path:.venvs/default34/build/amara3-iri/setup.py) egg_info for package amara3-iri
    
Installing collected packages: amara3-iri
  Running setup.py install for amara3-iri
    
Successfully installed amara3-iri
Cleaning up...


py.test-3.4 test/
============================================================================== test session starts ==============================================================================
platform linux -- Python 3.4.0 -- py-1.4.20 -- pytest-2.5.2
collected 0 items / 1 errors 

==================================================================================== ERRORS =====================================================================================
___________________________________________________________________ ERROR collecting test/uxml/test_parser.py ___________________________________________________________________
test/uxml/test_parser.py:3: in <module>
>   from amara3.uxml.parser import parse, parser, event
.venvs/default34/lib/python3.4/site-packages/amara3/uxml/parser.py:13: in <module>
>   from amara3.util import coroutine
E   ImportError: No module named 'amara3.util'

Uche Ogbuji

unread,
Jun 17, 2014, 5:00:16 PM6/17/14
to ak...@googlegroups.com
On Tue, Jun 17, 2014 at 2:58 PM, Sylvain Hellegouarch <s...@defuze.org> wrote:
My bad. sorry.

Not at all. You made me realize that it's not in the README.md, so I'm updating that now.

I'm also going to add a fun little example of use, since I realize all there is right now is the test suite.

--Uche

Uche Ogbuji

unread,
Jun 17, 2014, 5:05:04 PM6/17/14
to ak...@googlegroups.com
Well, first of all, thanks so much for being the first guinea pig :)

Here's what I think happened. I must have installed from source. Source is here:


It's possible the coroutine stuff is not in the 3.0.0a1 release.

Can you try installing from source above and see if that fixes it? If so I'll quickly prepare a 3.0.0a2 release of amara3-iri.

--Uche



--
You received this message because you are subscribed to the Google Groups "akara" group.
To unsubscribe from this group and stop receiving emails from it, send an email to akara+un...@googlegroups.com.
To post to this group, send email to ak...@googlegroups.com.
Visit this group at http://groups.google.com/group/akara.
For more options, visit https://groups.google.com/d/optout.

Uche Ogbuji

unread,
Jun 17, 2014, 5:30:30 PM6/17/14
to ak...@googlegroups.com
Yeah, I confirmed the problem by examining:


I've uploaded amara3-iri 3.0.0a2 to PyPI. HTH.

--Uche

Uche Ogbuji

unread,
Jun 17, 2014, 10:17:47 PM6/17/14
to ak...@googlegroups.com
OK more updates to the README, and added a simple example of use:


Above parses the project's README.md Markdown and prints out all the URLs linked.

Also, for fun, I whipped up a quick benchmark:

#Compute the length of the on-the-fly doc
$ python -c 'print(len("<a>" + "".join("<b attr=\"1&lt;2&gt;3\">4&lt;5&gt;6</b>"*10000) + "</a>"))'
370007

#Check parse timing
$ python -m timeit -s 'from amara3.uxml.parser import parse; doc="<a>" + "".join("<b attr=\"1&lt;2&gt;3\">4&lt;5&gt;6</b>"*10000) + "</a>"'  '(e for e in parse(doc))'
1000000 loops, best of 3: 1.24 usec per loop

So that's pretty encouraging to parse a 370K document in 1.24 usec (on my 2009 MacBookPro) :)

Sylvain Hellegouarch

unread,
Jun 18, 2014, 8:48:01 AM6/18/14
to ak...@googlegroups.com
Hi Uche,

Sorry yesterday I couldn't respond. I will look at it tonight.


--
You received this message because you are subscribed to the Google Groups "akara" group.
To unsubscribe from this group and stop receiving emails from it, send an email to akara+un...@googlegroups.com.
To post to this group, send email to ak...@googlegroups.com.
Visit this group at http://groups.google.com/group/akara.
For more options, visit https://groups.google.com/d/optout.

Sylvain Hellegouarch

unread,
Jun 18, 2014, 1:46:03 PM6/18/14
to ak...@googlegroups.com
Tests are now passing :)

Sylvain Hellegouarch

unread,
Jun 18, 2014, 4:32:57 PM6/18/14
to ak...@googlegroups.com
Hi Uche,

I didn't have a chance to write any tests (I must admit, I'm not familiar with the py.test syntax itself... I'm oldschool and still write unittests). Anyhow, I went though the parser code and it's quite refreshing that it's so concise. I understand it's a work in progress but it's really nice to have it all displayed in a single module.

A couple of questions:

* why do you return the event stack in your event?
* it seems to me that your parsing API (parsefrags for instance), will still expect a complete fragment as input. I would rather have something that allows me to feed byte per byte and events being yielded as soon as theyr arrive, rather then use an accumulator.

For instance:

from amara3.uxml.parser import parser

parser = parser()
event = parser.send('<hello id')
print(event)
event = parser.send('="12"')
print(event)
event = parser.send('>')
print(event)
event = parser.send('world')
print(event)
event = parser.send('</hello>')
print(event)

This is what I'd be the most interested in:

* XMPP stanzas are often sent out at once but on a slow network, the socket may read an incomplete set of stanzas. If two of them are received but only the first one is indeed complete, I still expect the parser to yield as soon as possible. This is what bridge does thanks to expat which supports partial feed.
* By yielding as soon as possible, it means one could also stop the parsing more quickly.

Just some thoughts...

Uche Ogbuji

unread,
Jun 19, 2014, 1:24:21 AM6/19/14
to ak...@googlegroups.com
Hi Sylvain,

I appreciate this discussion because this work is a complete reboot with MicroXML, Python 3 and a new ground-up architecture. Now is the time to pick holes in the API.


On Wed, Jun 18, 2014 at 2:32 PM, Sylvain Hellegouarch <s...@defuze.org> wrote:
Hi Uche,

I didn't have a chance to write any tests (I must admit, I'm not familiar with the py.test syntax itself... I'm oldschool and still write unittests). Anyhow, I went though the parser code and it's quite refreshing that it's so concise. I understand it's a work in progress but it's really nice to have it all displayed in a single module.

A couple of questions:

* why do you return the event stack in your event?

It doesn't return the event stack. It returns the element ancestors list for start and end element events. this eliminates one of the biggest sources of complexity in the code of callers of traditional XML APIs. Interestingly enough, John Cowan, author of the MicroLark API for MicroXML had the same idea at around the same time.

 
* it seems to me that your parsing API (parsefrags for instance), will still expect a complete fragment as input.

No, it doesn't. A quick look at the test suite shows many examples that are nowhere near complete fragments.

 
I would rather have something that allows me to feed byte per byte and events being yielded as soon as they arrive, rather then use an accumulator.

You can feed things byte by byte or in whatever chunks you desire. If you also want events incrementally that's no problem either, but you'll have to get used to non-traditional program flow (i.e. generators/coroutines).

 
For instance:

from amara3.uxml.parser import parser

parser = parser()
event = parser.send('<hello id')
print(event)
event = parser.send('="12"')
print(event)
event = parser.send('>')
print(event)
event = parser.send('world')
print(event)
event = parser.send('</hello>')
print(event)

I'm trying to piece together what you mean by the above.  It doesn't make sense to return a single event from your parser.send() invocation because there may be more than one event returned:

parser = parser()
event = parser.send('<hello id')
print(event)
event = parser.send('="12"')
print(event)
event = parser.send('>world</hello>')
print(event)

That last one should return 3 events.

You could have parser.send() return/yield a list of events:

parser = parser()
events = parser.send('<hello id')
print(events)
events = parser.send('="12"')
print(events)
events = parser.send('>world</hello>')
print(events)
 
Where events would be an empty list the first 2 times and the last line would print all 3 events. However, to implement it the way it's written above would require 1 of 2 approaches:

1) turn parser into a class, which would eliminate the simplicity of its internals and would also eliminate a lot of its efficiency because there would be a lot of function calls and stack-based state management. The design of the parser now uses iteration rather than recursion to manage the state machine, which in Python is a huge win for CPU and memory performance.

2) Make parser a generator which is sent input *and* yields results. This is *very* *very* tricky to do in Python, and in fact, David Beazley in his "Curious Course on Coroutines and Concurrency" [1] recommends strongly against it, for very good reason. Incidentally dabeaz's document is superb, and I highly recommend it to anyone looking into Amara3's internals or trying to use it in advanced ways. Heck, I recommend it to every Python programmer. Mastering coroutines pays huge dividends in learning how to write simpler and optimized code.


You can definitely get the basic idea of what you're asking for. You just have to stick to one "side", and to coroutines in particular:

from amara3.util import coroutine
from amara3.uxml.parser import parser, event

@coroutine
def handler():
    while True:
        ev = yield
        print(ev)
    return

h = handler()
p = parser(h)

p.send(('<hello id', False))
p.send(('="12"', False))
p.send(('>', False))
p.send(('world', False))
p.send(('</hello>', True))

Output (tested & works):

(<event.start_element: 1>, 'hello', {'id': '12'}, [])
(<event.characters: 3>, 'world')
(<event.end_element: 2>, 'hello', [])

You can think about the above as splitting the code that sends fragments and the code that handles events into two separate coroutines. I think in most actual use-cases such separation of concerns is more realistic, anyway. You would probably not want to interleave protocol handling and application logic in such a linear fashion.

Note: amara3.uxml.parser.parse and amara3.uxml.parser.parsefrags are merely higher level utility function over the lower level parser/handler coroutines interface because not everyone will want to deal with coroutines. Your use-case is a fairly advanced one.


This is what I'd be the most interested in:

* XMPP stanzas are often sent out at once but on a slow network, the socket may read an incomplete set of stanzas. If two of them are received but only the first one is indeed complete, I still expect the parser to yield as soon as possible. This is what bridge does thanks to expat which supports partial feed.
* By yielding as soon as possible, it means one could also stop the parsing more quickly.

Yes, that's exactly what the code above demonstrates. The coroutine receives events as soon as they are ready.


Sylvain Hellegouarch

unread,
Jun 19, 2014, 5:19:06 AM6/19/14
to ak...@googlegroups.com
Hi Uche,

Thanks for the feedback.


On 19 June 2014 07:24, Uche Ogbuji <uc...@ogbuji.net> wrote:
Hi Sylvain,

I appreciate this discussion because this work is a complete reboot with MicroXML, Python 3 and a new ground-up architecture. Now is the time to pick holes in the API.


On Wed, Jun 18, 2014 at 2:32 PM, Sylvain Hellegouarch <s...@defuze.org> wrote:
Hi Uche,

I didn't have a chance to write any tests (I must admit, I'm not familiar with the py.test syntax itself... I'm oldschool and still write unittests). Anyhow, I went though the parser code and it's quite refreshing that it's so concise. I understand it's a work in progress but it's really nice to have it all displayed in a single module.

A couple of questions:

* why do you return the event stack in your event?

It doesn't return the event stack. It returns the element ancestors list for start and end element events. this eliminates one of the biggest sources of complexity in the code of callers of traditional XML APIs. Interestingly enough, John Cowan, author of the MicroLark API for MicroXML had the same idea at around the same time.

That's a good idea but I didn't get it from the code... maybe a comment or two to help a poor soul :=)

 

 
* it seems to me that your parsing API (parsefrags for instance), will still expect a complete fragment as input.

No, it doesn't. A quick look at the test suite shows many examples that are nowhere near complete fragments.


I'll be honest, I didn't understand most of the test suite. I assume it's really geared towards py.test. I didn't understand what was tested.


 

 
I would rather have something that allows me to feed byte per byte and events being yielded as soon as they arrive, rather then use an accumulator.

You can feed things byte by byte or in whatever chunks you desire. If you also want events incrementally that's no problem either, but you'll have to get used to non-traditional program flow (i.e. generators/coroutines).

I'm fine with using generators but the code isn't making it clear how to. I can understand since it is at an early stage.
 

 
For instance:

from amara3.uxml.parser import parser

parser = parser()
event = parser.send('<hello id')
print(event)
event = parser.send('="12"')
print(event)
event = parser.send('>')
print(event)
event = parser.send('world')
print(event)
event = parser.send('</hello>')
print(event)

I'm trying to piece together what you mean by the above.  It doesn't make sense to return a single event from your parser.send() invocation because there may be more than one event returned:


I agree returning a single event makes little sense. My point being that I could get the events as soon as there were available. I don't want an accumulator. What for?

 

parser = parser()
event = parser.send('<hello id')
print(event)
event = parser.send('="12"')
print(event)
event = parser.send('>world</hello>')
print(event)

That last one should return 3 events.

You could have parser.send() return/yield a list of events:

parser = parser()
events = parser.send('<hello id')
print(events)
events = parser.send('="12"')
print(events)
events = parser.send('>world</hello>')
print(events)
 
Where events would be an empty list the first 2 times and the last line would print all 3 events. However, to implement it the way it's written above would require 1 of 2 approaches:


This is what I want indeed :)

 

1) turn parser into a class, which would eliminate the simplicity of its internals and would also eliminate a lot of its efficiency because there would be a lot of function calls and stack-based state management. The design of the parser now uses iteration rather than recursion to manage the state machine, which in Python is a huge win for CPU and memory performance.


I would discourage a class approach too.

 

2) Make parser a generator which is sent input *and* yields results. This is *very* *very* tricky to do in Python, and in fact, David Beazley in his "Curious Course on Coroutines and Concurrency" [1] recommends strongly against it, for very good reason. Incidentally dabeaz's document is superb, and I highly recommend it to anyone looking into Amara3's internals or trying to use it in advanced ways. Heck, I recommend it to every Python programmer. Mastering coroutines pays huge dividends in learning how to write simpler and optimized code.

Interestingly, I used that approach with ws4py (I think) and it worked just fine as far as I could see. I haven't read his document so I will.

The frame parse yields the number of bytes it needs to move on and is sent those bytes by the streamer.

I probably would have used "yield from" if I didn't care abvout Python 2.
 


You can definitely get the basic idea of what you're asking for. You just have to stick to one "side", and to coroutines in particular:

from amara3.util import coroutine
from amara3.uxml.parser import parser, event

@coroutine
def handler():
    while True:
        ev = yield
        print(ev)
    return

h = handler()
p = parser(h)

p.send(('<hello id', False))
p.send(('="12"', False))
p.send(('>', False))
p.send(('world', False))
p.send(('</hello>', True))

Output (tested & works):

(<event.start_element: 1>, 'hello', {'id': '12'}, [])
(<event.characters: 3>, 'world')
(<event.end_element: 2>, 'hello', [])

You can think about the above as splitting the code that sends fragments and the code that handles events into two separate coroutines. I think in most actual use-cases such separation of concerns is more realistic, anyway. You would probably not want to interleave protocol handling and application logic in such a linear fashion.


Mmmh, I'd be fine with this though it feels noisy due to the extra flag.

 

Note: amara3.uxml.parser.parse and amara3.uxml.parser.parsefrags are merely higher level utility function over the lower level parser/handler coroutines interface because not everyone will want to deal with coroutines. Your use-case is a fairly advanced one.


My bad then :/

Uche Ogbuji

unread,
Jun 19, 2014, 9:32:53 AM6/19/14
to ak...@googlegroups.com
On Tue, Jun 17, 2014 at 8:17 PM, Uche Ogbuji <uc...@ogbuji.net> wrote:
Also, for fun, I whipped up a quick benchmark:

#Compute the length of the on-the-fly doc
$ python -c 'print(len("<a>" + "".join("<b attr=\"1&lt;2&gt;3\">4&lt;5&gt;6</b>"*10000) + "</a>"))'
370007

#Check parse timing
$ python -m timeit -s 'from amara3.uxml.parser import parse; doc="<a>" + "".join("<b attr=\"1&lt;2&gt;3\">4&lt;5&gt;6</b>"*10000) + "</a>"'  '(e for e in parse(doc))'
1000000 loops, best of 3: 1.24 usec per loop

So that's pretty encouraging to parse a 370K document in 1.24 usec (on my 2009 MacBookPro) :)

I must have been tired then, because I just revisited and that result is in nowhere near credible.

(py3)Uche-Ogbujis-MacBook-Pro-2010:amara3-xml uche$ python -m timeit -s 'from amara3.uxml.parser import parse; doc="<a>" + "".join("<b attr=\"1&lt;2&gt;3\">4&lt;5&gt;6</b>"*10000) + "</a>"'  'for e in parse(doc): pass'
10 loops, best of 3: 2.6 sec per loop
(py3)Uche-Ogbujis-MacBook-Pro-2010:amara3-xml uche$ python -m timeit -s 'from amara3.uxml.parser import parse; doc="<a>" + "".join("<b attr=\"1&lt;2&gt;3\">4&lt;5&gt;6</b>"*100000) + "</a>"'  'for e in parse(doc): pass'
10 loops, best of 3: 25.7 sec per loop


Still encouraging is the linear progression with an order of magnitude document size growth, and the fact that in the latter run (3.7M doc or so) total resident size never seems to rise above 90M. Also actual CPU load is relatively low, so I'll have to investigate where the wait states are. I won't prematurely optimize, but I'll make a mental note to do some profiling later on.

Uche Ogbuji

unread,
Jun 19, 2014, 9:51:17 AM6/19/14
to ak...@googlegroups.com
On Thu, Jun 19, 2014 at 3:19 AM, Sylvain Hellegouarch <s...@defuze.org> wrote:
It doesn't return the event stack. It returns the element ancestors list for start and end element events. this eliminates one of the biggest sources of complexity in the code of callers of traditional XML APIs. Interestingly enough, John Cowan, author of the MicroLark API for MicroXML had the same idea at around the same time.

That's a good idea but I didn't get it from the code... maybe a comment or two to help a poor soul :=)

Yeah, good point. I'll work on that.


* it seems to me that your parsing API (parsefrags for instance), will still expect a complete fragment as input.

No, it doesn't. A quick look at the test suite shows many examples that are nowhere near complete fragments.


I'll be honest, I didn't understand most of the test suite. I assume it's really geared towards py.test. I didn't understand what was tested.

OK I guess more commenting is in order.

 
I would rather have something that allows me to feed byte per byte and events being yielded as soon as they arrive, rather then use an accumulator.

You can feed things byte by byte or in whatever chunks you desire. If you also want events incrementally that's no problem either, but you'll have to get used to non-traditional program flow (i.e. generators/coroutines).

I'm fine with using generators but the code isn't making it clear how to. I can understand since it is at an early stage.

I'd think you'd want to get that info from docs rather than the code. Parsing is pretty complex stuff, especially when you're dealing with something like XML, so the code is going to be overwhelmingly concerned with that part of things. Did my example in this thread help? The idea is to add it to the examples dir, and also use it in docs.

 
For instance:

from amara3.uxml.parser import parser

parser = parser()
event = parser.send('<hello id')
print(event)
event = parser.send('="12"')
print(event)
event = parser.send('>')
print(event)
event = parser.send('world')
print(event)
event = parser.send('</hello>')
print(event)

I'm trying to piece together what you mean by the above.  It doesn't make sense to return a single event from your parser.send() invocation because there may be more than one event returned:


I agree returning a single event makes little sense. My point being that I could get the events as soon as there were available. I don't want an accumulator. What for?

You don't have to use an accumulator unless you want an old-school parse API. In fact the accumulator is not exposed except in the same part of the code where you can figure out how to replace the accumulator with an immediate yield from the coroutine :)

 
parser = parser()
event = parser.send('<hello id')
print(event)
event = parser.send('="12"')
print(event)
event = parser.send('>world</hello>')
print(event)

That last one should return 3 events.

You could have parser.send() return/yield a list of events:

parser = parser()
events = parser.send('<hello id')
print(events)
events = parser.send('="12"')
print(events)
events = parser.send('>world</hello>')
print(events)
 
Where events would be an empty list the first 2 times and the last line would print all 3 events. However, to implement it the way it's written above would require 1 of 2 approaches:


This is what I want indeed :)

 

1) turn parser into a class, which would eliminate the simplicity of its internals and would also eliminate a lot of its efficiency because there would be a lot of function calls and stack-based state management. The design of the parser now uses iteration rather than recursion to manage the state machine, which in Python is a huge win for CPU and memory performance.


I would discourage a class approach too.

 

2) Make parser a generator which is sent input *and* yields results. This is *very* *very* tricky to do in Python, and in fact, David Beazley in his "Curious Course on Coroutines and Concurrency" [1] recommends strongly against it, for very good reason. Incidentally dabeaz's document is superb, and I highly recommend it to anyone looking into Amara3's internals or trying to use it in advanced ways. Heck, I recommend it to every Python programmer. Mastering coroutines pays huge dividends in learning how to write simpler and optimized code.

Interestingly, I used that approach with ws4py (I think) and it worked just fine as far as I could see. I haven't read his document so I will.

I highly recommend it. I don't know ws4py, but even before I read David's document (I've been experimenting with coroutines since the Christian Tismer Stackless Python days) it became clear that the generator.send() approach that Python adopted for in effect coroutines would cause a real tangle, and brain-bending logic for even simple cases, never mind something as complex as parsing markup.

 
The frame parse yields the number of bytes it needs to move on and is sent those bytes by the streamer.

I probably would have used "yield from" if I didn't care abvout Python 2.

I admit my generator/coroutine lore is largely still in the Python 2 space, so maybe there is something new I can try. In browsing your code, it doesn't really seem to be the same circumstances, but I can investigate further.

  

You can definitely get the basic idea of what you're asking for. You just have to stick to one "side", and to coroutines in particular:

from amara3.util import coroutine
from amara3.uxml.parser import parser, event

@coroutine
def handler():
    while True:
        ev = yield
        print(ev)
    return

h = handler()
p = parser(h)

p.send(('<hello id', False))
p.send(('="12"', False))
p.send(('>', False))
p.send(('world', False))
p.send(('</hello>', True))

Output (tested & works):

(<event.start_element: 1>, 'hello', {'id': '12'}, [])
(<event.characters: 3>, 'world')
(<event.end_element: 2>, 'hello', [])

You can think about the above as splitting the code that sends fragments and the code that handles events into two separate coroutines. I think in most actual use-cases such separation of concerns is more realistic, anyway. You would probably not want to interleave protocol handling and application logic in such a linear fashion.


Mmmh, I'd be fine with this though it feels noisy due to the extra flag.

Without the extra flag you'd need a separate call to ensure the document is completed, or of course just p.close(). I have thought of eliminating the "done" flag externally and relying on close() and I might still do that.


Note: amara3.uxml.parser.parse and amara3.uxml.parser.parsefrags are merely higher level utility function over the lower level parser/handler coroutines interface because not everyone will want to deal with coroutines. Your use-case is a fairly advanced one.


My bad then :/

Hey, good discussion.


Sylvain Hellegouarch

unread,
Jun 19, 2014, 9:58:12 AM6/19/14
to ak...@googlegroups.com



Hey, good discussion.



I will investigate your example further. The thread already helped understanding where you were coming from. I'll also read the talk on coroutine.
I think the send method on generator is powerful but, ovbiously I must be missing something.
As for terminating the parsing, perhaps close or throwing a specific exception?

p.throw(ParserDone).

With that said, I'm a little concerned that I, as a user, would have to know when the parsing is complete. Being XML, I assume it can determine by itself when it has closed the whole fragment, can't it? 

Uche Ogbuji

unread,
Jun 19, 2014, 10:07:13 AM6/19/14
to ak...@googlegroups.com
On Thu, Jun 19, 2014 at 7:58 AM, Sylvain Hellegouarch <s...@defuze.org> wrote:



Hey, good discussion.



I will investigate your example further. The thread already helped understanding where you were coming from. I'll also read the talk on coroutine.
I think the send method on generator is powerful but, ovbiously I must be missing something.
As for terminating the parsing, perhaps close or throwing a specific exception?

p.throw(ParserDone).

I did suggest close() in my response. I don't like the idea of a magic exception.

 
With that said, I'm a little concerned that I, as a user, would have to know when the parsing is complete. Being XML, I assume it can determine by itself when it has closed the whole fragment, can't it? 

It's possible from a code perspective to say "once the document element is closed terminate the parse" but if round-tripping is important to the user you would force them to carefully construct the end of the parse without the needed information to do so. In addition, MicroXML is designed for non-strict modes and error recovery and in particular amara3 will support cases such as multiple document elements (in non-strict mode) and useful error reporting (in strict mode), so it can't just unilaterally decide when the input is complete.

Sylvain Hellegouarch

unread,
Jun 19, 2014, 10:08:51 AM6/19/14
to ak...@googlegroups.com
On 19 June 2014 16:07, Uche Ogbuji <uc...@ogbuji.net> wrote:
On Thu, Jun 19, 2014 at 7:58 AM, Sylvain Hellegouarch <s...@defuze.org> wrote:



Hey, good discussion.



I will investigate your example further. The thread already helped understanding where you were coming from. I'll also read the talk on coroutine.
I think the send method on generator is powerful but, ovbiously I must be missing something.
As for terminating the parsing, perhaps close or throwing a specific exception?

p.throw(ParserDone).

I did suggest close() in my response. I don't like the idea of a magic exception.


Sorry, I was paraphrasing indeed :)
 

 
With that said, I'm a little concerned that I, as a user, would have to know when the parsing is complete. Being XML, I assume it can determine by itself when it has closed the whole fragment, can't it? 

It's possible from a code perspective to say "once the document element is closed terminate the parse" but if round-tripping is important to the user you would force them to carefully construct the end of the parse without the needed information to do so. In addition, MicroXML is designed for non-strict modes and error recovery and in particular amara3 will support cases such as multiple document elements (in non-strict mode) and useful error reporting (in strict mode), so it can't just unilaterally decide when the input is complete.


That sounds like a good argument for explicit over implicit indeed. 

Uche Ogbuji

unread,
Jun 19, 2014, 10:11:23 AM6/19/14
to ak...@googlegroups.com
On Thu, Jun 19, 2014 at 8:07 AM, Uche Ogbuji <uc...@ogbuji.net> wrote:
On Thu, Jun 19, 2014 at 7:58 AM, Sylvain Hellegouarch <s...@defuze.org> wrote:



Hey, good discussion.



I will investigate your example further. The thread already helped understanding where you were coming from. I'll also read the talk on coroutine.
I think the send method on generator is powerful but, ovbiously I must be missing something.
As for terminating the parsing, perhaps close or throwing a specific exception?

p.throw(ParserDone).

I did suggest close() in my response. I don't like the idea of a magic exception.

 
With that said, I'm a little concerned that I, as a user, would have to know when the parsing is complete. Being XML, I assume it can determine by itself when it has closed the whole fragment, can't it? 

It's possible from a code perspective to say "once the document element is closed terminate the parse" but if round-tripping is important to the user you would force them to carefully construct the end of the parse without the needed information to do so. In addition, MicroXML is designed for non-strict modes and error recovery and in particular amara3 will support cases such as multiple document elements (in non-strict mode) and useful error reporting (in strict mode), so it can't just unilaterally decide when the input is complete.


I forgot to mention that of course if you wanted simple, 1-document element parsing, Amara will tell you when the parse is "finished". That would be the case of an end_element event with an empty ancestor stack. You could easily check for that condition and then do:

p.send(('', True))

or, if we do indeed implement GeneratorExit (which could be tricky):

p.close()

That said I think I should define a constant

PARSE_DONE = ('', True)

Uche Ogbuji

unread,
Jun 19, 2014, 10:24:13 AM6/19/14
to ak...@googlegroups.com
On Thu, Jun 19, 2014 at 7:51 AM, Uche Ogbuji <uc...@ogbuji.net> wrote:
I probably would have used "yield from" if I didn't care abvout Python 2.

I admit my generator/coroutine lore is largely still in the Python 2 space, so maybe there is something new I can try. In browsing your code, it doesn't really seem to be the same circumstances, but I can investigate further.

Actually, I just browsed PEP 380 and I think that probably changes everything. I'm also seeing a few other Python 3 tricks I'm missing, so I'm going to spend time through the weekend absorbing, learning and experimenting, and then I can see a complete refactoring of amara3 in the direction you're suggesting.

I had been meaning to do that, but you definitely pushed me over the line, so thanks :)
 

Werner

unread,
Jun 24, 2014, 10:54:35 AM6/24/14
to ak...@googlegroups.com
Hi Uche,

I follow with interest the discussion on Amara 3, you seem to make very good progress.

On 6/19/2014 15:51, Uche Ogbuji wrote:
...

I'm fine with using generators but the code isn't making it clear how to. I can understand since it is at an early stage.

I'd think you'd want to get that info from docs rather than the code. Parsing is pretty complex stuff, especially when you're dealing with something like XML, so the code is going to be overwhelmingly concerned with that part of things. Did my example in this thread help? The idea is to add it to the examples dir, and also use it in docs.

If you adapt/use Sphinx to document you could have it both ways and just to it once.

E.g. in the module doc of 'parser.py' have this:

start quote:

Following is a simple example on how to parse a document::

    here you have the code sample

Note the double column.

More info can be found here:
sphinx-doc.org/rest.html

Werner
Reply all
Reply to author
Forward
0 new messages