[ANN] Instaparse 1.0.0

1,339 views
Skip to first unread message

Mark Engelberg

unread,
Apr 9, 2013, 1:18:39 AM4/9/13
to clojure
Instaparse is an easy-to-use, feature-rich parser generator for Clojure.  The two stand-out features:

1. Converts standard EBNF notation for context-free grammars into an executable parser.  Makes the task of building parsers as lightweight and simple as working with regular expressions.

2. Works with *any* context-free grammar.  This means you don't have to learn the esoteric subtleties of LR, LL, LALR or any other specialized subset.  Left-recursion, right-recursion, ambiguous grammars -- instaparse handles it all.

Example:

(def as-and-bs
  (parser
    "S = AB*
     AB = A B
     A = 'a'+
     B = 'b'+"))

=> (as-and-bs "aaaaabbbaaaabb")
[:S
 [:AB [:A "a" "a" "a" "a" "a"] [:B "b" "b" "b"]]
 [:AB [:A "a" "a" "a" "a"] [:B "b" "b"]]]

https://github.com/Engelberg/instaparse for full feature list and extensive tutorial.

Laurent PETIT

unread,
Apr 9, 2013, 2:35:18 AM4/9/13
to clo...@googlegroups.com
Man, this looks like pure gold!
--
--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clo...@googlegroups.com
Note that posts from new members are moderated - please be patient with your first post.
To unsubscribe from this group, send email to
clojure+u...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
---
You received this message because you are subscribed to the Google Groups "Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email to clojure+u...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Tassilo Horn

unread,
Apr 9, 2013, 4:33:54 AM4/9/13
to Mark Engelberg, clojure
Mark Engelberg <mark.en...@gmail.com> writes:

Hi Mark,

> Example:
>
> (def as-and-bs
> (parser
> "S = AB*
> AB = A B
> A = 'a'+
> B = 'b'+"))

Nice, but providing the grammar as a plain string looks somewhat
unnatural to me. Why not something like this (parser being a macro)?

(def as-and-bs
(parser
S = AB* .
AB = A B .
A = "a" + .
B = "b" + .))

I.e., symbols denote non-terminals, strings denote terminals, and the
dot indicates the end of a rule.

Bye,
Tassilo

Mark Engelberg

unread,
Apr 9, 2013, 5:41:38 AM4/9/13
to Mark Engelberg, clojure
On Tue, Apr 9, 2013 at 1:33 AM, Tassilo Horn <ts...@gnu.org> wrote:
Nice, but providing the grammar as a plain string looks somewhat
unnatural to me.  Why not something like this (parser being a macro)?

(def as-and-bs
  (parser
    S  = AB* .
    AB = A B .
    A  = "a" + .
    B  = "b" + .))

I.e., symbols denote non-terminals, strings denote terminals, and the
dot indicates the end of a rule.

Bye,
Tassilo

I played around with that, but even if you suppress evaluation by using a macro, Clojure's reader makes strong assumptions about certain symbols.  For example, it is standard in EBNF notation for {} to mean zero-or-more.  But if you include {A B C} in your grammar using the macro approach, Clojure's reader will throw an error because it treats {} as a map and expects an even number of forms to follow.  That was the main reason, but it also makes the notation much more sensitive to whitespace (for example, AB * versus AB*).  Gradually, those little issues start making it look less and less like traditional notation.  There's something really nice about just being able to copy and paste a grammar off of a website and have it just work.

I understand where you're coming from, though.  It definitely is part of the Clojure culture to avoid string representations for many kinds of data (e.g., SQL queries).  We do accept it for regular expressions, and for things like #inst "2011-12-31T19:00:00.000-05:00", though, and that's the kind of feel I was going for.  Would it be more psychologically palatable to type:
#insta/parser "S = 'a' 'b'"
rather than
(insta/parser "S = 'a' 'b'")
?

What do you think would be gained by making it a macro?  From my perspective, a macro is essentially just a string that is being processed by the Clojure reader (and thus subject to its constraints).  If the grammar were expressed in the way you propose, is it any easier to build up a grammar programmatically?  Is it any easier to compose grammars?  If anything, I think it might be harder.

Thanks for the comments,

Mark

Philipp Meier

unread,
Apr 9, 2013, 5:56:13 AM4/9/13
to clo...@googlegroups.com, Mark Engelberg


Am Dienstag, 9. April 2013 11:41:38 UTC+2 schrieb puzzler:
On Tue, Apr 9, 2013 at 1:33 AM, Tassilo Horn <ts...@gnu.org> wrote:
Nice, but providing the grammar as a plain string looks somewhat
unnatural to me.  Why not something like this (parser being a macro)?

(def as-and-bs
  (parser
    S  = AB* .
    AB = A B .
    A  = "a" + .
    B  = "b" + .))


I played around with that, but even if you suppress evaluation by using a macro, Clojure's reader makes strong assumptions about certain symbols.  For example, it is standard in EBNF notation for {} to mean zero-or-more.  But if you include {A B C} in your grammar using the macro approach, Clojure's reader will throw an error because it treats {} as a map and expects an even number of forms to follow.  That was the main reason, but it also makes the notation much more sensitive to whitespace (for example, AB * versus AB*).  Gradually, those little issues start making it look less and less like traditional notation.  There's something really nice about just being able to copy and paste a grammar off of a website and have it just work.


What about something like this, a little like enlive does encode selectors?

 (insta/parser :S [:AB :*] :AB [:A :B] :A ["a" :*] :B ["b" :*])

-billy.

Mark Engelberg

unread,
Apr 9, 2013, 5:56:26 AM4/9/13
to Mark Engelberg, clojure
On Tue, Apr 9, 2013 at 2:41 AM, Mark Engelberg <mark.en...@gmail.com> wrote:
What do you think would be gained by making it a macro?  From my perspective, a macro is essentially just a string that is being processed by the Clojure reader (and thus subject to its constraints).  If the grammar were expressed in the way you propose, is it any easier to build up a grammar programmatically?  Is it any easier to compose grammars?  If anything, I think it might be harder.


I should also point out, for anyone who hasn't had a chance to read through the tutorial, that instaparse does include a combinator interface that lets you build the parsers entirely through functions.   The way I look at it is that if you're going for human writing and readability, it's hard to beat strings (or reading from a resource file), and if you're going for a grammar that will be generated by a program, it's hard to beat functions.  So both are included.  But targeting some sort of functional-ish looking thing that doesn't look quite right and doesn't compose effectively -- that doesn't seem to serve either purpose well.

David Powell

unread,
Apr 9, 2013, 6:29:45 AM4/9/13
to clo...@googlegroups.com

Looks awesome.

Would it be possible to plug in support for the ABNF[1] notation that the IETF use?  Might be useful for implementing standards.  Mostly just a different syntax for repetition, and has support for comments.


-- 
Dave

Softaddicts

unread,
Apr 9, 2013, 6:39:54 AM4/9/13
to clo...@googlegroups.com
I am not convinced that typing more delimiter characters is a better thing.
The string representation is ... shorter and less error prone.

As Mark said, a string isolates you from the reader which could change
its behavior over time introducing other undesirable side effects like {}
with an uneven # of args.

Thank you for this parser Mark, it's just in time, I have a need for this that
came to life a week ago :) Will definitively try this in the next month or so.

Luc P.
> --
> --
> You received this message because you are subscribed to the Google
> Groups "Clojure" group.
> To post to this group, send email to clo...@googlegroups.com
> Note that posts from new members are moderated - please be patient with your first post.
> To unsubscribe from this group, send email to
> clojure+u...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/clojure?hl=en
> ---
> You received this message because you are subscribed to the Google Groups "Clojure" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to clojure+u...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>
>
--
Softaddicts<lprefo...@softaddicts.ca> sent by ibisMail from my ipad!

Mark Engelberg

unread,
Apr 9, 2013, 6:57:18 AM4/9/13
to clo...@googlegroups.com
Thanks for the suggestion.  I based the syntax off of EBNF, and hadn't run across ABNF notation before your link just now.  It shouldn't be too hard to add support for ABNF's repetition syntax and comments.  Getting the semantics of the terminal values to precisely match the ABNF spec seems like more effort -- how essential is that piece do you think, versus just using strings and regexes for the terminals? 

Support for comments in the EBNF notation is probably a good idea too.

Laurent PETIT

unread,
Apr 9, 2013, 7:47:53 AM4/9/13
to clo...@googlegroups.com
I find this library very exciting. It is incredibly well documented,
already covers a broad range of use cases, I can't wait for trying it.

There are also a lot of interesting ideas that may pollinate outside
it, I love the cross-pollination capability of open source software !
(i.e. I'd love to see some of these features into parsley).

Do you have a roadmap for the next releases ?

Of interest to me:
- restartable version ? (probably doable, if internal implementation
uses immutable datastructures)
- incremental version ? (Dunno if that's an easy one, and way beyond
my current programming skills anyway to be able to even judge)
- possibility to have transform-map(s) passed as an argument to parse,
so that a collection of views can be applied in parallel and the tree
be "traversed" once ?

Have you tried some tests to compare performance with e.g. same
grammars in Antlr ?

Thanks again for the hard work,

--
Laurent

2013/4/9 Mark Engelberg <mark.en...@gmail.com>:

Michael Klishin

unread,
Apr 9, 2013, 9:32:40 AM4/9/13
to clo...@googlegroups.com

2013/4/9 Laurent PETIT <lauren...@gmail.com>

I find this library very exciting. It is incredibly well documented,
already covers a broad range of use cases, I can't wait for trying it.

I concur. Thank you Mark for not letting your users down and writing

Tassilo Horn

unread,
Apr 9, 2013, 10:14:43 AM4/9/13
to Mark Engelberg, clojure
Mark Engelberg <mark.en...@gmail.com> writes:

Hi Mark,

>> (def as-and-bs
>> (parser
>> S = AB* .
>> AB = A B .
>> A = "a" + .
>> B = "b" + .))
>
> I played around with that, but even if you suppress evaluation by
> using a macro, Clojure's reader makes strong assumptions about certain
> symbols. For example, it is standard in EBNF notation for {} to mean
> zero-or-more. But if you include {A B C} in your grammar using the
> macro approach, Clojure's reader will throw an error because it treats
> {} as a map and expects an even number of forms to follow.

Indeed, I didn't think about this one.

> That was the main reason, but it also makes the notation much more
> sensitive to whitespace (for example, AB * versus AB*). Gradually,
> those little issues start making it look less and less like
> traditional notation. There's something really nice about just being
> able to copy and paste a grammar off of a website and have it just
> work.

That's nice.

> I understand where you're coming from, though. It definitely is part of
> the Clojure culture to avoid string representations for many kinds of data
> (e.g., SQL queries). We do accept it for regular expressions, and for
> things like #inst "2011-12-31T19:00:00.000-05:00", though, and that's the
> kind of feel I was going for. Would it be more psychologically palatable
> to type:
> #insta/parser "S = 'a' 'b'"
> rather than
> (insta/parser "S = 'a' 'b'")
> ?

No, not really.

> What do you think would be gained by making it a macro?

I don't really care if it's a macro or a function, but if the grammar
was represented using symbols, keywords, and clojure collections, you'd
get at least a bit syntax highlighting and maybe a bit support for
indentation or refactoring.

> From my perspective, a macro is essentially just a string that is
> being processed by the Clojure reader (and thus subject to its
> constraints). If the grammar were expressed in the way you propose,
> is it any easier to build up a grammar programmatically? Is it any
> easier to compose grammars? If anything, I think it might be harder.

I think it would be nice if grammars were expressed in a format
processable by Clojure in the very same respect as why homoiconicity in
general is a strength of Lisps.

FWIW, the Bison equivalent Emacs Lisp parser Wisent uses grammars
represented as s-exps.

http://www.gnu.org/software/emacs/manual/html_node/wisent/Grammar-format.html#Grammar-format

IMHO, it would be cool if a clojure parser library would use a similar
format (exploiting clojure data structures beyond lists where they make
sense), at least internally. Of course, one could still have other
frontends like for the EBNF you are using which would be translated to
the internal format.

Bye,
Tassilo

Rich Hickey

unread,
Apr 9, 2013, 10:36:09 AM4/9/13
to clo...@googlegroups.com
That looks stunning - congrats and thanks!

Mark Engelberg

unread,
Apr 9, 2013, 2:32:53 PM4/9/13
to clo...@googlegroups.com
Wow, the enthusiastic response has been really gratifying.  Thanks for the kind words;  I hope as people try it out that it lives up to the hype.

On Tue, Apr 9, 2013 at 4:47 AM, Laurent PETIT <lauren...@gmail.com> wrote:
Do you have a roadmap for the next releases ?

Initially, my focus will be on gathering feedback from users: making sure it's as stable as I think it is, gathering feature requests.

Beyond that, my next objective is to produce a version that can leverage multiple threads.  I'm not sure whether the parser really needs to be any faster than it currently is, but Clojure makes it so easy, I might as well offer that.


Of interest to me:
- restartable version ? (probably doable, if internal implementation
uses immutable datastructures)
- incremental version ? (Dunno if that's an easy one, and way beyond
my current programming skills anyway to be able to even judge)

I don't know offhand how doable these are either.  I'd need to investigate further.  It wasn't really a goal of mine, since Parsley already addresses that space.
 
- possibility to have transform-map(s) passed as an argument to parse,
so that a collection of views can be applied in parallel and the tree
be "traversed" once ?

Actually, I've already done something very similar to this.  You can find it in the "transform" branch of the github page, and you're welcome to try it out and let me know what you think.  The tutorial has been updated on that branch to explain how to use that feature, but it basically works the way you'd expect -- you simply pass the transform map as an argument to *parser* (the parser-building function, not the parse function) using the keyword :transform.  When you do this, the transforms are done during parse time so the tree is traversed only once.  The tradeoff is that parsers with "baked-in transforms" can't be serialized with print-dup.

I like the feature, but hesitated to include it.  I timed parsers with baked-in transforms against those where the transforms were applied after the fact.  On the examples I tried, the time difference was minuscule, presumably because the version with the baked-in transforms ends up applying the transforms to various partial parses along the way.  Some of these partial parses are eventually discarded, so applying transforms to those unsuccessful parses means you're doing extra work that potentially cancels out much of the savings from not having to traverse the tree a second time.

What this means is that if I include this feature, I have to make sure users clearly understand that it is a tradeoff (possibly better performance, possibly not, depending on context; affects serialization).  I like the idea of adding that level of control as to whether the transform is done at parse time or afterwards but, based on my limited experimentation, the benefits didn't seem to outweigh the degree to which it adds to the learning curve.

I decided, for now, to make it available as a branch for people to play around with it, and see if it offers any compelling advantages that tip the balance in favor of including it.

Your suggestion to make it a keyword argument to the parse function is also possible; I mainly included it in the parser-building function because then the attachment of the transform functions to the parser only needs to be done once, giving it a (tiny) performance boost versus doing it every time in the parse function.


Have you tried some tests to compare performance with e.g. same
grammars in Antlr ?

No, I haven't done back-to-back performance tests.  I'm guessing that Antlr would be hard to compete with.  Before I began work, I played around with other Clojure parsers to get a feel for their performance, and I believe instaparse to be competitive with those.  That statement is based more on feel than precise benchmarks, though.  I focused my performance tuning on improving instaparse relative to itself, trying to get parse time to scale linearly with the size of the input, for as many grammars as possible.

Mark Engelberg

unread,
Apr 9, 2013, 3:00:16 PM4/9/13
to Mark Engelberg, clojure
On Tue, Apr 9, 2013 at 7:14 AM, Tassilo Horn <ts...@gnu.org> wrote:
IMHO, it would be cool if a clojure parser library would use a similar
format (exploiting clojure data structures beyond lists where they make
sense), at least internally.  Of course, one could still have other
frontends like for the EBNF you are using which would be translated to
the internal format.

There already are several clojure parser libraries that rely more heavily on Clojure data structure syntax to define grammars; in fact, that's probably the norm.  Instaparse was definitely intended as an alternative to that approach.

That said, instaparse definitely uses Clojure data structures internally to represent the parser and provides constructor functions to build them.  It would be pretty easy to build other front-ends that translate to its internal format, including one that tries to strike a balance between readability and homoiconicity.

Mark Engelberg

unread,
Apr 10, 2013, 2:33:48 PM4/10/13
to clo...@googlegroups.com
On Wed, Apr 10, 2013 at 12:22 AM, sesm <sergey.s...@gmail.com> wrote:
Hi Mark,

Amazing stuff, I didn't know, that such general parsing techniques even exist!

One minor comment: it would be nice to add direct links to GLL papers and https://github.com/epsil/gll github repo to save people some googling time.

Links added.  Thanks!

Stathis Sideris

unread,
Apr 11, 2013, 10:08:29 AM4/11/13
to clo...@googlegroups.com
Thanks, this looks fantastic! Is there any way to include comments in the syntax at all?

Thanks,

Stathis

Steven Degutis

unread,
Apr 11, 2013, 10:16:29 AM4/11/13
to clo...@googlegroups.com
I'm not the OP, but if you mean comments within the grammar, sure as long as "comment" is well defined, and if you mean comments *about* the grammar within the parser code, how about creating the grammar using (str) with one string per line, and putting your comments after each line's string?

-Steven


--

Stathis Sideris

unread,
Apr 11, 2013, 10:21:24 AM4/11/13
to clo...@googlegroups.com
Sorry for the lack of clarity -- I meant comments about the grammar within the resource file that describes it. The (str) approach would be good for embedded grammars in Clojure code, but I'm working with a fairly bulky grammar which would be better to keep in its own file.

Stathis

Mark Engelberg

unread,
Apr 11, 2013, 1:12:34 PM4/11/13
to clo...@googlegroups.com
On Thu, Apr 11, 2013 at 7:08 AM, Stathis Sideris <sid...@gmail.com> wrote:
Thanks, this looks fantastic! Is there any way to include comments in the syntax at all?

Last night I started a v1.1 branch to start work on these feature requests.  The first feature I added was, in fact, support for EBNF comment syntax:
(* this is a comment *)

Supports nested comments too, in case you want to wrap a comment around a comment.

Add [instaparse "1.1.0-SNAPSHOT"] to your dependencies and give it a try.

As I add features, I'll add them to the v1.1 change log:
https://github.com/Engelberg/instaparse/blob/v1.1/CHANGES.md
and also update the README on the v1.1 branch:
https://github.com/Engelberg/instaparse/blob/v1.1/README.md

--Mark

Dmitry Kakurin

unread,
Apr 11, 2013, 10:41:51 PM4/11/13
to clo...@googlegroups.com
Hi Mark,
Brilliant work, thank you!

I was playing with your arithmetic parser from tutorial, and noticed the following definition for add and sub:
    "expr = add-sub
     <add-sub> = mul-div | add | sub
     add = add-sub <'+'> mul-div
     sub = add-sub <'-'> mul-div
...
And I realize now that the ordering "add-sub <'+'> mul-div" here is very deliberate.
What should be my intuition for your parser for default ambiguity resolution rules? (sorry, I'm not familiar [yet] with GLL parsers)
Is it guaranteed to be non-greedy, i.e. it consumes as little as possible?
The docs section "No Grammar Left Behind" is somewhat fuzzy about it.

If I were writing LALR parser I would do (along with defining precedence and associativity rules for op) something like:
     expr = expr op expr | ...
And for LL parser I would do:
     add = mul-div <'+'> add-sub
(i.e. reverse ordering compared to yours).

But your parser rules are somewhat new to me.
Both variations are accepted:
     add = add-sub <'+'> add-sub
     add = mul-div <'+'> add-sub
And in both cases some generated parsers are correct (arithmetically speaking :-) ), but I'd like to understand rules for the first/default parser.
Could you clarify it a little please?

Great docs, BTW.

Thanks, Dmitry.

Mark Engelberg

unread,
Apr 12, 2013, 3:51:36 AM4/12/13
to clojure
On Thu, Apr 11, 2013 at 7:41 PM, Dmitry Kakurin <dmitry....@gmail.com> wrote:
But your parser rules are somewhat new to me.
Both variations are accepted:
     add = add-sub <'+'> add-sub
     add = mul-div <'+'> add-sub
And in both cases some generated parsers are correct (arithmetically speaking :-) ), but I'd like to understand rules for the first/default parser.
Could you clarify it a little please?

Let's start with a simpler example, a grammar that only involves + and numbers (no handling for parens or whitespace).

(def addition
  (insta/parser
    "plus = plus <'+'> plus | num
     num = #'[0-9]'+"))

If you call (parses addition "1+2+3"), you'll see that that two parse results are returned.  This grammar is ambiguous and you'll get one parse result that corresponds to (1+2)+3 and another parse result that corresponds to 1+(2+3).

I gather from your questions that you really want to know which of these parse results will come first (i.e., which one will be returned if you only ask for one parse result).  The answer is:  I make no guarantee as to which will come first.  (I'm not just being ornery; this lack of guarantee is important for my future plans for a concurrent version).

Now in this particular case, since addition is associative, maybe we're perfectly happy to let the ambiguity stand and let instaparse just pick one of the two possible results for us.  But if we don't want that ambiguity, we must alter the grammar until we see that parses always returns one result.  In the above grammar, there are a couple straightforward ways to resolve the disambiguation:

(def addition-associate-right
  (insta/parser
    "plus = num <'+'> plus | num
     num = #'[0-9]'+"))

(def addition-associate-left 
  (insta/parser
    "plus = plus <'+'> num | num
     num = #'[0-9]'+"))

You mentioned that you were familiar with LL grammars.  So this is similar to LL in the sense that if you want a precise result, you need to make the grammar precise -- you can't rely on external precedence or associativity rules.  However, the resulting disambiguated grammars are much more natural.

Consider the above addition-associate-left grammar -- LL would choke on this because it is left-recursive.  If you want to associate to the left in LL, you have to transform the natural-looking left-recursive rule into something much messier.  This is an unfortunate limitation of LL grammars.  For correct calculation, arithmetic operators of equal precedence really need to associate to the left (i.e., 1-2-3 should evaluate as (1-2)-3, not 1-(2-3)), and left recursion is the easiest way to express that.

To better understand this point, take a look at:
http://www.csse.monash.edu.au/~lloyd/tildeProgLang/Grammar/Top-Down/

First, it shows the arithmetic grammar you'd *want* to write.  You'll see it looks almost identical to the grammar I used in the tutorial (although I added a root tag and broke out the operators individually for evaluation purposes).

Then, it goes on to show that in LL, you can't actually write the grammar the way you'd want to write it.  It demonstrates two transformations that eventually turn the grammar into something LL can handle.

The beauty of the tutorial example is that it shows you can get a completely unambiguous grammar, just by writing it down in the natural way that expresses the precedence and associativity constraints.  Unlike LL, the elegant way just works.

So, to summarize the main points here:

* As you develop your grammar, test your grammar with the parses function to determine whether your grammar is ambiguous.
* If your grammar is ambiguous, but the ambiguity doesn't cause a problem, then that's fine, just leave it ambiguous.
* If you need to disambiguate, you must do it by building more precision into the grammar.
* Don't need to worry about limitations of LL, but the mindset of top-down parsing is similar.

I hope this helps!

--Mark

Stathis Sideris

unread,
Apr 12, 2013, 9:15:50 AM4/12/13
to clo...@googlegroups.com
Thank you very much, I'll give it a try!

Dmitry Kakurin

unread,
Apr 12, 2013, 4:55:55 PM4/12/13
to clo...@googlegroups.com
Thanks Mark,
Here is my understanding and few more questions (please correct me if I'm missing something).

    plus = plus <'+'> num | num
is unambiguous left-associative. It is valid and is guaranteed not to cause infinite recursion.

    plus = num <'+'> plus | num
is unambiguous right-associative. Also valid and is guaranteed not to cause infinite recursion.

    plus = plus <'+'> plus | num
is ambiguous. It's allowed but the default parser choice is unpredictable by design.
Is it even consistent across runs? (I'm guessing yes for now)
And across future instaparse versions? (I'm guessing no)

Is there a way to check if my parser is ambiguous without feeding it some input and then using insta/parses fn?
Basically after calling insta/parser I'd like to learn if grammar is OK.

Here is where my question is coming from:
If I were to use such parser in production I'd like it to be unambiguous.
And I'd like to detect ambiguity early, before my software ships/deployed. Preferably during build/packaging/deployment time.
But since for Clojure projects all these things are somewhat fuzzy, at very least I'd like to detect ambiguity during my app startup.
I.e. I'd like to put a big fat assert during initialization phase.

Is there a way to do it now (or planned in the future)?

Thanks for being patient with me :-), Dmitry.

Brandon Bloom

unread,
Apr 12, 2013, 8:55:52 PM4/12/13
to clo...@googlegroups.com
Super cool! Nice work.

Your readme says "I had difficulty getting his Parsing with Derivatives technique to work in a performant way".  I was wondering if you could please elaborate.

What kind of performance did you achieve?
How does that compare to the GLL parser you implemented?
Did you implement memoization/compaction/fixed-point/etc from the latest research? 
How do the implementations compare in terms of code size and readability?

Thanks,
Brandon

Mark Engelberg

unread,
Apr 12, 2013, 9:10:59 PM4/12/13
to clo...@googlegroups.com
On Fri, Apr 12, 2013 at 1:55 PM, Dmitry Kakurin <dmitry....@gmail.com> wrote:
Here is where my question is coming from:
If I were to use such parser in production I'd like it to be unambiguous.
And I'd like to detect ambiguity early, before my software ships/deployed. Preferably during build/packaging/deployment time.
But since for Clojure projects all these things are somewhat fuzzy, at very least I'd like to detect ambiguity during my app startup.
I.e. I'd like to put a big fat assert during initialization phase.

Is there a way to do it now (or planned in the future)?


From the Wikipedia article on ambiguous grammars:
"The general decision problem of whether a grammar is ambiguous is undecidable..."

So it is impossible to make a tool that can tell you whether a parser is ambiguous.  I believe there are ways to detect certain classes of ambiguity, or warn of potential problems.  I have no idea how well those techniques work and how hard they are to implement.  I'll add that to my list of things to research someday.  But for now, ambiguity has to be tested with test cases, just as you'd do for other aspects of your code.

--Mark

P.S.  All the other things you stated, checking for understanding, were correct.

Cedric Greevey

unread,
Apr 12, 2013, 11:19:29 PM4/12/13
to clo...@googlegroups.com
Is there not a way to have your parallel parsing cake and eat a deterministic ordering too? Namely, generate the (undeterministic) list of possible parses and then parallel mergesort it in some way that imposes a deterministic total order on distinct parses. Lexicographic would suffice, I'd expect.

Or maybe even just build the output up in a sorted-set held in an atom, for that matter. The sorted inserts will be distributed among the various threads, and retries will be rare if the parse generating time dominates the sorted-insert time by an order or two of magnitude.

Mark Engelberg

unread,
Apr 13, 2013, 1:48:03 AM4/13/13
to clo...@googlegroups.com
On Fri, Apr 12, 2013 at 5:55 PM, Brandon Bloom <brandon...@gmail.com> wrote:
Your readme says "I had difficulty getting his Parsing with Derivatives technique to work in a performant way".  I was wondering if you could please elaborate.

What kind of performance did you achieve?
How does that compare to the GLL parser you implemented?
Did you implement memoization/compaction/fixed-point/etc from the latest research? 
How do the implementations compare in terms of code size and readability?



"Parsing with Derivatives" is a very interesting algorithm: each time it processes a character, it creates a new grammar for the remainder of the text to be parsed.  The grammar is then compacted.  That process of creating a new grammar and then compacting it each time you process a character proved to be rather memory intensive, and I felt the time lag was pretty noticeable, even with only a few thousand characters.

I did try to follow the latest research.  I had quite a bit of difficulty getting the fixed point process to work right.  The Racket sample code used a form of dynamic variable that they call "parameter" that doesn't map exactly to what Clojure allows you to do with vars.  I tried a few approaches, but none quite worked the way I expected.  I never got this implemented satisfactorily.

Another complication is that in Racket, it's easier to create self-referential and mutually recursive data structures than in Clojure, and the reference implementation relied pretty heavily on this aspect of Racket.  I was playing around with kotarak/lazymap as an attempt to solve the problem, and I did have some success with that.

But perhaps the biggest problem occurred when I tried to work on the version of the parser that returns a parse tree (as opposed to just "recognizing" the input as satisfying the grammar).  The way that process works is that with each transformation of the parser, you build up a huge continuation of what you plan to create when the end is reached.  In Racket, the stack is effectively unlimited so this is no problem.  But in Clojure, when you reach the end and try to execute that continuation -- boom, stack overflow.  The length of the input was limited by the stack; that's no good.

So I tried to build up the continuation in a way that could be executed in a trampoline, but with the laziness and multiple prongs of execution, it got ugly real fast.  Never really got the stack overflow resolved completely, which made it difficult to test the Clojure version on realistic-sized inputs.

I'm not saying these problems are insurmountable, I'm just saying that I really struggled with them.  I was getting extremely frustrated and close to giving up when I found out about the GLL algorithm. 

The Derivatives code was definitely shorter and more readable, but I felt it was a lot easier to reason about the GLL algorithm and figure out how to tune it for Clojure and add features to it.  I was able to quickly get it to the point where I could handle large inputs without stack overflow and, on extremely simple grammars, could parse 200,000 characters in a couple seconds on my crappy laptop.  I knew that was fast enough to be useful for a lot of tasks and a night and day difference from what I had been achieving after investing far more effort on the Derivatives code.  So I dropped the Derivatives implementation and didn't look back.

David Powell

unread,
Apr 13, 2013, 5:45:16 AM4/13/13
to clo...@googlegroups.com

Would it be possible to have an option to tag the parse tree with meta-data indicating the character offset (or line & column)?

-- 
Dave
 

Mark Engelberg

unread,
Apr 13, 2013, 6:20:59 AM4/13/13
to clo...@googlegroups.com
Yes, quite possible.  I've added it to the enhancement list on the github issues page.

Brandon Bloom

unread,
Apr 14, 2013, 5:43:44 PM4/14/13
to clo...@googlegroups.com
Thanks for the details. You definitely made the right pragmatic decision. What you've said pretty much matches what I expected to hear, although I'm hopeful that the approach can be refined, since it's quite eloquent. Beyond eloquence, the derivatives approach is also interesting for schema validation. See: http://www.thaiopensource.com/relaxng/derivative.html

Regarding memory pressure, I think the primary pain point originates from the same underlying cause: Those Racket-isms that you found difficult to port to Clojure. Cyclic data structures, in particular, preclude the structural sharing that occurs with incremental changes to persistent data structures. Additionally, memoization is thwarted because cycles complicate equality. Cycles are better off encoded as named references in a spanning tree (and, like an AST, would need nominal equality) and Laziness can be achieved indirectly via fixed-point rewriting (rather than explicit thunks) which is already necessary for the algorithm.

I asked Matt Might about all this, but Twitter isn't really the best forum for that conversation.

Cheers,
Brandon

Dmitry Kakurin

unread,
Apr 14, 2013, 7:14:45 PM4/14/13
to clo...@googlegroups.com
Interesting, I did not know that.
That's OK if checks do not *guarantee* correctness.
But having 20/80 Pareto principle in mind: if few simple detection technics will warn about 80% of ambiguous grammars (especially the ones found in practice), that would be very helpful.

Thanks, Dmitry.

Max Gonzih

unread,
May 2, 2013, 10:19:14 AM5/2/13
to clo...@googlegroups.com, Mark Engelberg
Hi, what do you think about dsl version using map?
Nice Idea was proposed here http://www.reddit.com/r/Clojure/comments/1djbio/growing_a_lanugage_with_clojure_and_instaparse/c9qwv4d

On Tuesday, April 9, 2013 12:41:38 PM UTC+3, puzzler wrote:
On Tue, Apr 9, 2013 at 1:33 AM, Tassilo Horn <ts...@gnu.org> wrote:
Nice, but providing the grammar as a plain string looks somewhat
unnatural to me.  Why not something like this (parser being a macro)?

(def as-and-bs
  (parser

    S  = AB* .
    AB = A B .
    A  = "a" + .
    B  = "b" + .))

I.e., symbols denote non-terminals, strings denote terminals, and the
dot indicates the end of a rule.

Bye,
Tassilo

I played around with that, but even if you suppress evaluation by using a macro, Clojure's reader makes strong assumptions about certain symbols.  For example, it is standard in EBNF notation for {} to mean zero-or-more.  But if you include {A B C} in your grammar using the macro approach, Clojure's reader will throw an error because it treats {} as a map and expects an even number of forms to follow.  That was the main reason, but it also makes the notation much more sensitive to whitespace (for example, AB * versus AB*).  Gradually, those little issues start making it look less and less like traditional notation.  There's something really nice about just being able to copy and paste a grammar off of a website and have it just work.


I understand where you're coming from, though.  It definitely is part of the Clojure culture to avoid string representations for many kinds of data (e.g., SQL queries).  We do accept it for regular expressions, and for things like #inst "2011-12-31T19:00:00.000-05:00", though, and that's the kind of feel I was going for.  Would it be more psychologically palatable to type:
#insta/parser "S = 'a' 'b'"
rather than
(insta/parser "S = 'a' 'b'")
?

What do you think would be gained by making it a macro?  From my perspective, a macro is essentially just a string that is being processed by the Clojure reader (and thus subject to its constraints).  If the grammar were expressed in the way you propose, is it any easier to build up a grammar programmatically?  Is it any easier to compose grammars?  If anything, I think it might be harder.

Thanks for the comments,

Mark

Mark Engelberg

unread,
May 2, 2013, 2:37:01 PM5/2/13
to Max Gonzih, clo...@googlegroups.com
On Thu, May 2, 2013 at 7:19 AM, Max Gonzih <gon...@gmail.com> wrote:
Hi, what do you think about dsl version using map?

The short answer is that this is already supported.  In the github readme, you'll find a section labeled "Combinators", in which I outline instaparse's dsl for creating grammars as a map from keywords to combinators.

Reply all
Reply to author
Forward
0 new messages