[GSoC 2013] Parboiled2: PEG parser via Scala Macros

Alexander Myltsev

unread,

Apr 27, 2013, 12:16:20 PM4/27/13

to scala-l...@googlegroups.com

Good day.

My name is Alexander Myltsev. I am going to apply to GSoC this year with Parboiled2 project.

You probably have heard of Parboiled [1] – lightweight library providing input text parsing based on PEGs (parsing expression grammars). It is written in Java and has nice wrapper in Scala. You can find projects using parboiled at [2].

A Parboiled parser is a complex object of objects that lives on heap. It is quite easy to compose a parser. But essentially it “interprets” the parser rule tree. So, it is rather slow.

On the other hand a parser could be implemented with heavy usage of mutations, rather fast arithmetic and logical operations, and avoids object constructions and function callings like at [3]. It is faster, but is far from functional style and harder to program by hand.

Could good-looking parser program and fast runtime speed be combined? Yes, it can be done by some sort of parser “compilation” from rules with help of Scala macros. The prototype is at [4]. You can find elegant combinator-style parser construction at [5]:

def combination = rule { ('a' || 'b' || 'c') ~ ('d' || 'e') ~ 'f' }

that is expanded to a class with mutations similar to [3].

There are three major stages from current prototype to first version of ready to use library:
1. Transform Scala AST to intermediate rule operation tree (“OpTree”). OpTree is easier to operate than complicated constructs like at [6].
2. Apply transformations and optimizations to OpTree. E.g. expand string matching to their expanded forms. That means that [7] will be needed no more.
3. Transform processed OpTree back to Scala AST with parser classes of higher speed at runtime

I would formulate it more carefully in my proposal. Still, idea should be clear enough.

What do you think of that? Is it good enough for GSoC? Should it be extended somehow? Any ideas are welcome!

Mathias Doenitz [8] -- the original creator of Parboiled -- is kindly agreed to mentor this project.

[1] https://github.com/sirthias/parboiled
[2] https://github.com/sirthias/parboiled/wiki/Projects-using-parboiled
[3] https://github.com/spray/spray/blob/master/spray-http/src/main/scala/spray/http/parser/UriParser.scala
[4] https://github.com/alexander-myltsev/parboiled2
[5] https://github.com/alexander-myltsev/parboiled2/blob/develop/src/test/scala/org/parboiled/ParserSpec.scala#L27
[6] https://github.com/alexander-myltsev/parboiled2/blob/develop/src/main/scala/org/parboiled/Parser.scala#L73
[7] https://github.com/alexander-myltsev/parboiled2/blob/develop/src/main/scala/org/parboiled/Parser.scala#L54
[8] https://github.com/sirthias

Regards,
Alexander

Eugene Burmako

unread,

Apr 27, 2013, 4:03:11 PM4/27/13

to scala-l...@googlegroups.com

From what I see you've already done some implementation work. What difficulties did you encounter? Any functionality provided by macros which didn't work properly?

All in all, the advertised marriage of macros and high-level parser sounds amazing! Nevertheless, are there any conceptual problems / limitations associated with the compile-time approach to parsing?

Finally, have you taken a look at LMS? I heard folks over there have also achieved pretty good results accelerating parsing.

Haoyi Li

unread,

Apr 29, 2013, 2:35:29 AM4/29/13

to scala-l...@googlegroups.com

Not entirely related, but it would be cool if you could use macros to avoid the duplication of variable names/ordering when defining a matcher along with a transformer function:

pair = string ~ ":" ~ json_exp >> {case (k, _, v) => (k, v) }

With unhygienic macros, you could munge with the scoping and get some really nice syntax:

pair = (string is k) ~ ":" ~ (json_exp is v) >> (k, v)

I've been exploring a similar concept for parser combinators in Python (also using macros!), and it really makes extracting the interesting values out of a large matcher much more convenient.

--
You received this message because you are subscribed to the Google Groups "scala-language" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scala-languag...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Alexander Myltsev

unread,

Apr 29, 2013, 10:46:21 AM4/29/13

to scala-l...@googlegroups.com

Dear Eugene,

On Sunday, 28 April 2013 00:03:11 UTC+4, Eugene Burmako wrote:

From what I see you've already done some implementation work. What difficulties did you encounter? Any functionality provided by macros which didn't work properly?

Macros work just fine for current moment. I saw no bugs.

Main difficulty is to observe what is under the hood of compiled code. Personally I do not find (may be yet) convenient to work with Scala AST produced by -Ymacro-debug-lite. I hoped that debugger would have provided more information. But unfortunately I did not manage to launch it -- seems like it needs more time to do it.

Bounded variables during compilation are quite surprising. But I fixed these issues quite easy.

All in all, the advertised marriage of macros and high-level parser sounds amazing! Nevertheless, are there any conceptual problems / limitations associated with the compile-time approach to parsing?

In case of LL(k) parser generator it is required to look at the sub-rules. With Macros it is hard (if possible at all) to look inside a rule, since rule macro only sees the definition of method. PEG rely on backtracking to determine which one of alternatives to take. So, there is no need to look ahead, and generating a PEG parser with macros is possible from that point.

Do you see more conceptual problems/limitations here? Please, share them.

Finally, have you taken a look at LMS? I heard folks over there have also achieved pretty good results accelerating parsing.

I know LMS. I was working with Scalan that is based on LMS. But I haven't heard about parsing acceleration. Please, share links that you know.

Regards,
Alexander

Alexander Myltsev

unread,

Apr 29, 2013, 10:49:24 AM4/29/13

to scala-l...@googlegroups.com

Dear Haoyi,

that's nice feature to save some typing. AFAIK current version of Parboiled supports similar feature.

Regards,
Alexander

Eugene Burmako

unread,

Apr 29, 2013, 12:02:54 PM4/29/13

to scala-l...@googlegroups.com, Manohar Jonnalagedda, Nada Amin

/cc Manohar & Nada, could you please elaborate on parsing with LMS?

--

Johannes Rudolph

unread,

Apr 29, 2013, 12:21:25 PM4/29/13

to scala-l...@googlegroups.com

On Mon, Apr 29, 2013 at 8:35 AM, Haoyi Li <haoy...@gmail.com> wrote:
> With unhygienic macros, you could munge with the scoping and get some really
> nice syntax:
>
> pair = (string is k) ~ ":" ~ (json_exp is v) >> (k, v)

Even without untyped macros the syntax in the current parboiled is already

> string ~ ":" ~ json_exp ~~> ((_, _))

if you have a `case class Pair(a, b)` it becomes

> string ~ ":" ~ json_exp ~~> Pair

I agree that your syntax would be more explicit about what is
happening. Another approach could be to use a monadic form where
values are extracted with a for-comprehension:

for { k <- string
_ <- ":"
v <- json_exp } yield (k, v)

--
Johannes

-----------------------------------------------
Johannes Rudolph
http://virtual-void.net

Haoyi Li

unread,

Apr 29, 2013, 3:18:46 PM4/29/13

to scala-l...@googlegroups.com

I'm most familiar with the scala inbuilt combinator library, so don't mind if I ask a few questions: in

> string ~ ":" ~ json_exp ~~> ((_, _))

How is it I do not need to worry about the return value of the ":"? I would have expected to have to write:

> string ~ ":" ~ json_exp ~~> (a, b, c) => (a, c)

In order to ignore the ":".

The example I gave didn't really show what I meant (I picked it to be simple) so here's another try. This is something that's meant to parse a JSON array. Bear with me for the pseudo-scala (a mix between parboiled and scala-parser-combinators), since my parser-combinatory skills are rusty:

val array = "[" ~ json_exp ~ rep("," ~ json_exp) ~ opt(space) ~ "]" ~~> {case (_, first, rest_pairs, _, _) => first +: rest_pairs.map(_._2)}

The huge converter is required because the parser would naturally return a

(String, JsVal, Seq[(String, JsVal)], String, String)

which I need to convert into a

Seq[JsVal]

So I need to do a bunch of messy stuff in order to extract the T and Seq[T] from the messy parse tree. Now, if I use name-binding, I am able to convert it to:

val array = "[" ~ (json_exp is first) ~ rep("," ~ (json_exp is rest)) ~ opt(space) ~ "]" ~~> (first +: rest)

Where `first` is bound to the JsVal returned by json_exp, and `rest` is bound the a Seq[JsVal] because it's bound to the json_exp inside a rep() call, and the transformer puts the JsVal and Seq[JsVal] together into one big Seq[JsVal]. Doing it this way is much easier than receiving the raw parse tree and trying to destructure it/extract the parts I want via indexing.

Does this make any sense? It would need to bind new names within the parser-expression which are available within the scope of the transformer-function, something which AFAIK is impossible without macros.

Johannes Rudolph

unread,

Apr 30, 2013, 3:44:59 AM4/30/13

to scala-l...@googlegroups.com

Hi Haoyi,

On Mon, Apr 29, 2013 at 9:18 PM, Haoyi Li <haoy...@gmail.com> wrote:
>> string ~ ":" ~ json_exp ~~> ((_, _))
> How is it I do not need to worry about the return value of the ":"? I would
> have expected to have to write:
>
>> string ~ ":" ~ json_exp ~~> (a, b, c) => (a, c)

In parboiled every rule can parse to a String but only explicitly.
I.e. if not specified the parser for a string literal is a `Rule0`
which doesn't produce any output. You can use the `~>` combinator to
create a `Rule1[X]` from the String it has parsed.

> The example I gave didn't really show what I meant (I picked it to be
> simple) so here's another try. This is something that's meant to parse a
> JSON array. Bear with me for the pseudo-scala (a mix between parboiled and
> scala-parser-combinators), since my parser-combinatory skills are rusty:
>
> val array = "[" ~ json_exp ~ rep("," ~ json_exp) ~ opt(space) ~ "]" ~~>
> {case (_, first, rest_pairs, _, _) => first +: rest_pairs.map(_._2)}
>
> The huge converter is required because the parser would naturally return a
>
> (String, JsVal, Seq[(String, JsVal)], String, String)

I understand. As I said above parboiled chose to parse to Strings only
explicitly which makes separator strings disappear in parameter lists.

See https://github.com/spray/spray-json/blob/master/src/main/scala/spray/json/JsonParser.scala
for a real parboiled json parser.

> So I need to do a bunch of messy stuff in order to extract the T and Seq[T]
> from the messy parse tree. Now, if I use name-binding, I am able to convert
> it to:
>
> val array = "[" ~ (json_exp is first) ~ rep("," ~ (json_exp is rest)) ~
> opt(space) ~ "]" ~~> (first +: rest)

I agree and I think your may argument still hold even with parboiled:
E.g. if the function you pass to `~~>` doesn't fit with the order of
values you have parsed before the syntax you propose would make things
better to read. My point was that you can already improve the syntax
over standard Scala parser combinators much without having to resort
to macros.

> Does this make any sense? It would need to bind new names within the
> parser-expression which are available within the scope of the
> transformer-function, something which AFAIK is impossible without macros.

I think it could make sense. Something related was considered to
improve the syntax of sbt setting definitions. As I said another way
to bind new names without macros is using monadic style (which
sometimes won't look so clean a a possible macro-based custom syntax).

Haoyi Li

unread,

Apr 30, 2013, 8:20:03 AM4/30/13

to scala-l...@googlegroups.com

Looking at what you describe, it seems that ignoring the output of the unwanted leaves makes a lot of sense and goes a long way towards what I want without using macros. That's pretty cool!

Good to know I'm not crazy; I just thought it was a cool idea so I wanted to throw it out there, and this seemed like a good time. Sorry for hijacking your thread Alexander!

-Haoyi

nafg

unread,

Apr 30, 2013, 6:14:58 PM4/30/13

to scala-l...@googlegroups.com

On Monday, April 29, 2013 2:35:29 AM UTC-4, Haoyi Li wrote:

Not entirely related, but it would be cool if you could use macros to avoid the duplication of variable names/ordering when defining a matcher along with a transformer function:

pair = string ~ ":" ~ json_exp >> {case (k, _, v) => (k, v) }