[ANN] pex, a powerful PEG parsing library

584 views
Skip to first unread message

Ghadi Shayban

unread,
Nov 17, 2015, 1:06:30 AM11/17/15
to Clojure
Here is a compliant JSON parser in 99 LOC, implemented with pex, a new parsing library. [1]

This is alpha software.

Hot off the heels of Colin Fleming's Conj talk on PEGs [2], I'm making public an early version of pex [3], a parsing library in the PEG family.  For those of you familiar with Lua's LPEG library, this is similar, but with a Clojure-y twist.  Like LPEG, pex uses a virtual machine to parse rules (as opposed to combinators or packrat.)  Unlike Colin's talk, pex operates on traditional character types, not generic sequences/data structures.

Why? Parsing Expression Grammars are simpler than most other grammar types, but more powerful than regular expressions. They do not introduce ambiguity -- you get one valid parse or none.  There exists a nice space in between regexes (yuck) and instaparse (power but ambiguity).

Here is a tiny example grammar to match floating point numbers:

(def Number '{number  [digits (? fractional) (? exponent)]
              fractional ["." digits]
              exponent   ["e" (? (/ "+" "-")) digits]
              digits     [(class num) (* (class num))]})

The only other input this particular grammar needs is to let pex know what a `num` character class is.  (There is an interface that you can implement to match things, and several helpers.  I'm planning to have several common ones out of the box.)  Well, you also need to tell the grammar compiler what rule to start with (number).

The grammar format has user defined macros which let you hide a lot of boilerplate, or make higher order rules.  For example, it's very common to chew whitespace after rules, so hiding that is useful.  There are also captures and actions that operate on a virtual "Value Stack".  For example, while parsing a JSON array, you push all the captured values from the array onto the stack, then reduce them into a vector with an action.

It's very early, but pex's completely unoptimized engine can parse a 1.5MB file in ~58ms vs ~9ms for Cheshire/Jackson, which is a handwritten highly-tuned parser with many thousands of lines of code behind it.  I plan on closing that gap by a) implementing some of LPEG's compiler optimizations and b) improving some of the terribly naive impls in the parser. The win here is high expressive power per unit of performance, not raw performance... 

Internally, the grammar data structure is analyzed, compiled into special parsing bytecode, and then subsequently run inside a virtual machine [4].

Hope you can find this useful in your data munging endeavors.  Next up is to make CSV & EDN example parsers, tune the performance, make grammar debugging better, and write more docs & tests.  I encourage any feedback.


bernardH

unread,
Nov 19, 2015, 8:24:37 AM11/19/15
to Clojure
This is interesting !
It reminds me of Parsnip from C.Grand [0], have you considered it when desining pex ? As your parser is focusing of characters, I am wondering : could the operations triggered by the execution of your pex code be simple enough to warrant actual compiling to JVM bytecode (at run time, with ASM [1]) for maximum performance ?

Best Regards,

Bernard

[0] https://github.com/cgrand/parsnip/
[1] http://asm.ow2.org/

Ghadi Shayban

unread,
Nov 19, 2015, 8:43:32 AM11/19/15
to Clojure
Thanks for taking a look.

User-level bytecode allows me an easier choice to build a JIT or tracing infrastructure, while being far less complex than writing out JVM bytecode during grammar compile.

Christophe has certainly been a help offline with design choices.  I wanted PEG, no ambiguity, unlike instaparse or parsnip.  Most of the API inspiration was from LPEG & Scala's Parboiled2.  Some of the VM internals are close to JRuby's Joni regex engine.

Colin Fleming

unread,
Nov 19, 2015, 3:58:47 PM11/19/15
to clo...@googlegroups.com
I'm really impressed by how fast it is out of the box with basically no optimisations. Tatu Saloranta is fanatical about Jackson performance, getting to within 6x on the first attempt is very promising.

--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clo...@googlegroups.com
Note that posts from new members are moderated - please be patient with your first post.
To unsubscribe from this group, send email to
clojure+u...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
---
You received this message because you are subscribed to the Google Groups "Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email to clojure+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ghadi Shayban

unread,
Nov 20, 2015, 8:16:03 AM11/20/15
to Clojure
The performance baseline was a pleasant surprise, but does get me thinking: How complex is Jackson?  It's thousands of lines of code.  I know I just wrote this library to help avoid writing recursive descent parsers manually, but maybe for JSON specifically an approach like Go's scanner [1] is worth pursuing in Java.

As for Pex itself, lots of areas of potential optimization:
1) Better char matchers -- like int-mask based ones.  Initial profiling suggested this.
2) Perhaps aggressive inlining of rules
3) Find out what Roberto Ierusalimschy did when he built a JIT for LPEG.  He said it improved perf by 3x, but it's not in mainline.
4) Implement the TestChar and TestCharset bytecodes from the LPEG paper -- I have the AST to be able to do it.  (It helps prevent call stack churn when you immediately jump into a rule and then fail because the first char doesn't match.)  There is a valuable match capture optimization to perform too.
5) Perhaps tear apart the StackEntry object into int[].  VM debugging was such a PITA, I would have never finished it if I did that at the outset.
Reply all
Reply to author
Forward
0 new messages