Hi folks,I've actually made more progress than I've expected on Amara 3 (makes a huge difference when I need it for work). We now have a rudimentary MicroXML parser implemented. Reminder: the only built-in parser for Amara 3 will be for MicroXML. Full/legacy XML support will be via a pyexpat utility to convert from XML to MicroXML.
I ended up going with a hand-crafted parser. Yes I've been telling people not to write their own XML parser for over 15 years, but MicroXML is much simplified so that if you have a lot of experience, it's not as daunting a task. James Clark wrote a hand-crafted JS parser and that inspired me to do the same in Python. It's very fast and has some features that are very rare in XML parsers, such as the ability to feed in the XML input little by little (e.g. reading from a socket or stream and parsing incrementally).
This is brilliant. I had used such a mechanism in my old XML library [1] so that I could use it to parse XMPP streams [2]. I will happily try it with MicroXML some day :)
Oh, cool,
I'll help with some cases for your test suite.
On Tue, Jun 17, 2014 at 9:56 AM, Luis Miguel Morillas <mori...@gmail.com> wrote:
Oh, cool,
I'll help with some cases for your test suite.Thanks! The test suite so far is here:It uses py.testErroneous documents aren't very well tested, and it would be good to get many more patterns of well-formed docs tested as well.Please feel free to fork and issue pull requests with any progress: the Github way.
--
You received this message because you are subscribed to the Google Groups "akara" group.
To unsubscribe from this group and stop receiving emails from it, send an email to akara+un...@googlegroups.com.
To post to this group, send email to ak...@googlegroups.com.
Visit this group at http://groups.google.com/group/akara.
For more options, visit https://groups.google.com/d/optout.
My bad. sorry.
--
You received this message because you are subscribed to the Google Groups "akara" group.
To unsubscribe from this group and stop receiving emails from it, send an email to akara+un...@googlegroups.com.
To post to this group, send email to ak...@googlegroups.com.
Visit this group at http://groups.google.com/group/akara.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "akara" group.
To unsubscribe from this group and stop receiving emails from it, send an email to akara+un...@googlegroups.com.
To post to this group, send email to ak...@googlegroups.com.
Visit this group at http://groups.google.com/group/akara.
For more options, visit https://groups.google.com/d/optout.
Hi Uche,I didn't have a chance to write any tests (I must admit, I'm not familiar with the py.test syntax itself... I'm oldschool and still write unittests). Anyhow, I went though the parser code and it's quite refreshing that it's so concise. I understand it's a work in progress but it's really nice to have it all displayed in a single module.A couple of questions:* why do you return the event stack in your event?
* it seems to me that your parsing API (parsefrags for instance), will still expect a complete fragment as input.
I would rather have something that allows me to feed byte per byte and events being yielded as soon as they arrive, rather then use an accumulator.
For instance:from amara3.uxml.parser import parserparser = parser()event = parser.send('<hello id')print(event)event = parser.send('="12"')print(event)event = parser.send('>')print(event)event = parser.send('world')print(event)event = parser.send('</hello>')print(event)
This is what I'd be the most interested in:
* XMPP stanzas are often sent out at once but on a slow network, the socket may read an incomplete set of stanzas. If two of them are received but only the first one is indeed complete, I still expect the parser to yield as soon as possible. This is what bridge does thanks to expat which supports partial feed.* By yielding as soon as possible, it means one could also stop the parsing more quickly.
Hi Sylvain,I appreciate this discussion because this work is a complete reboot with MicroXML, Python 3 and a new ground-up architecture. Now is the time to pick holes in the API.On Wed, Jun 18, 2014 at 2:32 PM, Sylvain Hellegouarch <s...@defuze.org> wrote:
Hi Uche,I didn't have a chance to write any tests (I must admit, I'm not familiar with the py.test syntax itself... I'm oldschool and still write unittests). Anyhow, I went though the parser code and it's quite refreshing that it's so concise. I understand it's a work in progress but it's really nice to have it all displayed in a single module.A couple of questions:* why do you return the event stack in your event?It doesn't return the event stack. It returns the element ancestors list for start and end element events. this eliminates one of the biggest sources of complexity in the code of callers of traditional XML APIs. Interestingly enough, John Cowan, author of the MicroLark API for MicroXML had the same idea at around the same time.
* it seems to me that your parsing API (parsefrags for instance), will still expect a complete fragment as input.No, it doesn't. A quick look at the test suite shows many examples that are nowhere near complete fragments.
I would rather have something that allows me to feed byte per byte and events being yielded as soon as they arrive, rather then use an accumulator.You can feed things byte by byte or in whatever chunks you desire. If you also want events incrementally that's no problem either, but you'll have to get used to non-traditional program flow (i.e. generators/coroutines).
For instance:from amara3.uxml.parser import parserparser = parser()event = parser.send('<hello id')print(event)event = parser.send('="12"')print(event)event = parser.send('>')print(event)event = parser.send('world')print(event)event = parser.send('</hello>')print(event)I'm trying to piece together what you mean by the above. It doesn't make sense to return a single event from your parser.send() invocation because there may be more than one event returned:
parser = parser()event = parser.send('<hello id')print(event)event = parser.send('="12"')print(event)event = parser.send('>world</hello>')print(event)That last one should return 3 events.You could have parser.send() return/yield a list of events:parser = parser()events = parser.send('<hello id')print(events)events = parser.send('="12"')print(events)events = parser.send('>world</hello>')print(events)Where events would be an empty list the first 2 times and the last line would print all 3 events. However, to implement it the way it's written above would require 1 of 2 approaches:
1) turn parser into a class, which would eliminate the simplicity of its internals and would also eliminate a lot of its efficiency because there would be a lot of function calls and stack-based state management. The design of the parser now uses iteration rather than recursion to manage the state machine, which in Python is a huge win for CPU and memory performance.
2) Make parser a generator which is sent input *and* yields results. This is *very* *very* tricky to do in Python, and in fact, David Beazley in his "Curious Course on Coroutines and Concurrency" [1] recommends strongly against it, for very good reason. Incidentally dabeaz's document is superb, and I highly recommend it to anyone looking into Amara3's internals or trying to use it in advanced ways. Heck, I recommend it to every Python programmer. Mastering coroutines pays huge dividends in learning how to write simpler and optimized code.
You can definitely get the basic idea of what you're asking for. You just have to stick to one "side", and to coroutines in particular:from amara3.util import coroutinefrom amara3.uxml.parser import parser, event@coroutinedef handler():while True:ev = yieldprint(ev)returnh = handler()p = parser(h)p.send(('<hello id', False))p.send(('="12"', False))p.send(('>', False))p.send(('world', False))p.send(('</hello>', True))Output (tested & works):(<event.start_element: 1>, 'hello', {'id': '12'}, [])(<event.characters: 3>, 'world')(<event.end_element: 2>, 'hello', [])You can think about the above as splitting the code that sends fragments and the code that handles events into two separate coroutines. I think in most actual use-cases such separation of concerns is more realistic, anyway. You would probably not want to interleave protocol handling and application logic in such a linear fashion.
Note: amara3.uxml.parser.parse and amara3.uxml.parser.parsefrags are merely higher level utility function over the lower level parser/handler coroutines interface because not everyone will want to deal with coroutines. Your use-case is a fairly advanced one.
Also, for fun, I whipped up a quick benchmark:#Compute the length of the on-the-fly doc$ python -c 'print(len("<a>" + "".join("<b attr=\"1<2>3\">4<5>6</b>"*10000) + "</a>"))'370007#Check parse timing$ python -m timeit -s 'from amara3.uxml.parser import parse; doc="<a>" + "".join("<b attr=\"1<2>3\">4<5>6</b>"*10000) + "</a>"' '(e for e in parse(doc))'1000000 loops, best of 3: 1.24 usec per loopSo that's pretty encouraging to parse a 370K document in 1.24 usec (on my 2009 MacBookPro) :)
It doesn't return the event stack. It returns the element ancestors list for start and end element events. this eliminates one of the biggest sources of complexity in the code of callers of traditional XML APIs. Interestingly enough, John Cowan, author of the MicroLark API for MicroXML had the same idea at around the same time.That's a good idea but I didn't get it from the code... maybe a comment or two to help a poor soul :=)
* it seems to me that your parsing API (parsefrags for instance), will still expect a complete fragment as input.No, it doesn't. A quick look at the test suite shows many examples that are nowhere near complete fragments.I'll be honest, I didn't understand most of the test suite. I assume it's really geared towards py.test. I didn't understand what was tested.
I would rather have something that allows me to feed byte per byte and events being yielded as soon as they arrive, rather then use an accumulator.You can feed things byte by byte or in whatever chunks you desire. If you also want events incrementally that's no problem either, but you'll have to get used to non-traditional program flow (i.e. generators/coroutines).I'm fine with using generators but the code isn't making it clear how to. I can understand since it is at an early stage.
For instance:from amara3.uxml.parser import parserparser = parser()event = parser.send('<hello id')print(event)event = parser.send('="12"')print(event)event = parser.send('>')print(event)event = parser.send('world')print(event)event = parser.send('</hello>')print(event)I'm trying to piece together what you mean by the above. It doesn't make sense to return a single event from your parser.send() invocation because there may be more than one event returned:I agree returning a single event makes little sense. My point being that I could get the events as soon as there were available. I don't want an accumulator. What for?
parser = parser()event = parser.send('<hello id')print(event)event = parser.send('="12"')print(event)event = parser.send('>world</hello>')print(event)That last one should return 3 events.You could have parser.send() return/yield a list of events:parser = parser()events = parser.send('<hello id')print(events)events = parser.send('="12"')print(events)events = parser.send('>world</hello>')print(events)Where events would be an empty list the first 2 times and the last line would print all 3 events. However, to implement it the way it's written above would require 1 of 2 approaches:This is what I want indeed :)1) turn parser into a class, which would eliminate the simplicity of its internals and would also eliminate a lot of its efficiency because there would be a lot of function calls and stack-based state management. The design of the parser now uses iteration rather than recursion to manage the state machine, which in Python is a huge win for CPU and memory performance.I would discourage a class approach too.2) Make parser a generator which is sent input *and* yields results. This is *very* *very* tricky to do in Python, and in fact, David Beazley in his "Curious Course on Coroutines and Concurrency" [1] recommends strongly against it, for very good reason. Incidentally dabeaz's document is superb, and I highly recommend it to anyone looking into Amara3's internals or trying to use it in advanced ways. Heck, I recommend it to every Python programmer. Mastering coroutines pays huge dividends in learning how to write simpler and optimized code.Interestingly, I used that approach with ws4py (I think) and it worked just fine as far as I could see. I haven't read his document so I will.
The frame parse yields the number of bytes it needs to move on and is sent those bytes by the streamer.
I probably would have used "yield from" if I didn't care abvout Python 2.
You can definitely get the basic idea of what you're asking for. You just have to stick to one "side", and to coroutines in particular:from amara3.util import coroutinefrom amara3.uxml.parser import parser, event@coroutinedef handler():while True:ev = yieldprint(ev)returnh = handler()p = parser(h)p.send(('<hello id', False))p.send(('="12"', False))p.send(('>', False))p.send(('world', False))p.send(('</hello>', True))Output (tested & works):(<event.start_element: 1>, 'hello', {'id': '12'}, [])(<event.characters: 3>, 'world')(<event.end_element: 2>, 'hello', [])You can think about the above as splitting the code that sends fragments and the code that handles events into two separate coroutines. I think in most actual use-cases such separation of concerns is more realistic, anyway. You would probably not want to interleave protocol handling and application logic in such a linear fashion.Mmmh, I'd be fine with this though it feels noisy due to the extra flag.
Note: amara3.uxml.parser.parse and amara3.uxml.parser.parsefrags are merely higher level utility function over the lower level parser/handler coroutines interface because not everyone will want to deal with coroutines. Your use-case is a fairly advanced one.My bad then :/
Hey, good discussion.
Hey, good discussion.I will investigate your example further. The thread already helped understanding where you were coming from. I'll also read the talk on coroutine.I think the send method on generator is powerful but, ovbiously I must be missing something.As for terminating the parsing, perhaps close or throwing a specific exception?p.throw(ParserDone).
With that said, I'm a little concerned that I, as a user, would have to know when the parsing is complete. Being XML, I assume it can determine by itself when it has closed the whole fragment, can't it?
On Thu, Jun 19, 2014 at 7:58 AM, Sylvain Hellegouarch <s...@defuze.org> wrote:
Hey, good discussion.I will investigate your example further. The thread already helped understanding where you were coming from. I'll also read the talk on coroutine.I think the send method on generator is powerful but, ovbiously I must be missing something.As for terminating the parsing, perhaps close or throwing a specific exception?p.throw(ParserDone).I did suggest close() in my response. I don't like the idea of a magic exception.
With that said, I'm a little concerned that I, as a user, would have to know when the parsing is complete. Being XML, I assume it can determine by itself when it has closed the whole fragment, can't it?
It's possible from a code perspective to say "once the document element is closed terminate the parse" but if round-tripping is important to the user you would force them to carefully construct the end of the parse without the needed information to do so. In addition, MicroXML is designed for non-strict modes and error recovery and in particular amara3 will support cases such as multiple document elements (in non-strict mode) and useful error reporting (in strict mode), so it can't just unilaterally decide when the input is complete.
On Thu, Jun 19, 2014 at 7:58 AM, Sylvain Hellegouarch <s...@defuze.org> wrote:
Hey, good discussion.I will investigate your example further. The thread already helped understanding where you were coming from. I'll also read the talk on coroutine.I think the send method on generator is powerful but, ovbiously I must be missing something.As for terminating the parsing, perhaps close or throwing a specific exception?p.throw(ParserDone).I did suggest close() in my response. I don't like the idea of a magic exception.With that said, I'm a little concerned that I, as a user, would have to know when the parsing is complete. Being XML, I assume it can determine by itself when it has closed the whole fragment, can't it?
It's possible from a code perspective to say "once the document element is closed terminate the parse" but if round-tripping is important to the user you would force them to carefully construct the end of the parse without the needed information to do so. In addition, MicroXML is designed for non-strict modes and error recovery and in particular amara3 will support cases such as multiple document elements (in non-strict mode) and useful error reporting (in strict mode), so it can't just unilaterally decide when the input is complete.
I probably would have used "yield from" if I didn't care abvout Python 2.I admit my generator/coroutine lore is largely still in the Python 2 space, so maybe there is something new I can try. In browsing your code, it doesn't really seem to be the same circumstances, but I can investigate further.
I'm fine with using generators but the code isn't making it clear how to. I can understand since it is at an early stage.
I'd think you'd want to get that info from docs rather than the code. Parsing is pretty complex stuff, especially when you're dealing with something like XML, so the code is going to be overwhelmingly concerned with that part of things. Did my example in this thread help? The idea is to add it to the examples dir, and also use it in docs.