scala-parser: a parboiled2 parser for Scala in 700LOC

1,174 views
Skip to first unread message

Haoyi Li

unread,
Nov 29, 2014, 5:38:50 AM11/29/14
to scala-internals
Hey All,

As some of you may know, I have recently spent a lot of time with Parboiled2.

The result of this ~50 hours of effort is scala-parser, a Parboiled2 parser for the Scala programming language syntax. This parser is still not complete (missing XML syntax and unicode escapes) but apart from those two features for which I've explicitly blacklisted files, scala-parser is able to parse every single .scala file in 
  • scala/scala
  • scala-js/scala-js
  • scalaz/scalaz
  • milessabin/shapeless
  • akka/akka
And it's a total of ~700LOC, for everything, including all the un-documented features of Scala syntax that are no-where to be found in the lexical specification (e.g. semicolon insertion & context-sensitivity, string interpolation, key-word/key-operator lookahead requirements, macros, annotation-types-can't-have-symbolic-names, ...). 

Adding the missing XML syntax and unicode-escapes would be about 1-day of work, given how much time I've spent on the parser so far (i.e. not much).

Presumably having a complete, reasonable, lightweight, modular parser (e.g. not the one in scalac) for the language would be of use to a bunch of people. scala-parser currently just identifies grammar and blows up if it can't parse something, but its straightforward to make it spit out an AST or whatever else you want (range positions, etc.). I haven't actively benchmarked performance, but it walks and parses all 5 of those projects in 80 seconds, which seems plenty fast.

At 700 LOC, this is basically compact enough to be an executable specification. The grammar in the Scala spec may be shorter, but it is also wrong and incomplete (to put it mildly) so, and worse, and there's no way forward to ensure it matches reality in any way. i.e. it's all lies and wishful thinking. This one actually works! And you can feed in reams and reams of Scala code to verify it matches reality.

The parser is still not complete, but is pretty close. It would be cool if people could try it out and see if it's useful for anything they're working on, or help find/fix edge cases in the parser.

Hopefully this is useful to someone ^_^ 

Odd Möller

unread,
Nov 29, 2014, 6:51:16 AM11/29/14
to scala-i...@googlegroups.com

awesome
/ˈɔːs(ə)m/

adjective

  1. extremely impressive or daunting; inspiring awe.

Eugene Burmako

unread,
Nov 29, 2014, 7:24:06 AM11/29/14
to scala-i...@googlegroups.com
Looks really impressive! Thanks a lot for sharing your work. Seems like next week I really ought to learn more about parboiled2 to see how to do range positions and how to construct custom AST nodes during parsing.

On Saturday, November 29, 2014 11:38:50 AM UTC+1, Haoyi Li wrote:

Mathias Doenitz

unread,
Nov 29, 2014, 7:55:03 AM11/29/14
to scala-i...@googlegroups.com
Awesome work, Haoyi!!

I'd like to say again that this is probably the largest (public) project with parboiled2 so far and as such is bound to stretch the library here and there, especially as pb2 is still quite young.
Haoyi already has identified a few areas that definitely need improving (especially with regard to documentation and error reporting) and I'll try and address these points ASAP.
Even more so this an impressive piece of engineering on Haoyi's part!

Please let me know if you think that there are things that I can do from the pb2 side to support this initiative (in addition to prioritizing certain tickets)!

Cheers,
Mathias

Rodrigo Cano

unread,
Nov 29, 2014, 10:02:23 AM11/29/14
to scala-i...@googlegroups.com
This is great, I had started the exact same task some time ago, also using pb2, but never got to finish it (and I also suffered hard the spec's syntax definition, aside the parts where it was outdated, it was just TOO LL parser unfriendly). I'm curious to see how you solved a particular case with type pattern in a case clause which looks like:
case a: b => c => d

what should that yield from the parser perspective? a value a of type b => c? a value a of type b returning a function?, an incomplete type? But that is just an arbitrary choice, after all the spec defines nothing regarding this, and what scalac does in my opinion is not ideal at all, even yielding an error where it shouldn't, but I digress.

I'm really happy that this is done :D. Huge thanks!.

--
You received this message because you are subscribed to the Google Groups "scala-internals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scala-interna...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Som Snytt

unread,
Nov 29, 2014, 12:44:13 PM11/29/14
to scala-internals
Christmas isn't for another month.

As someone pointed out on the ML, besides "0xf" there is "0XF".  I can't wait to learn enough to submit a PR.

I hope to fix "0x" accepted as value zero so that scala-parser needn't worry about bug-compatibility.

It needn't, need it?

Jon Pretty

unread,
Nov 29, 2014, 12:49:07 PM11/29/14
to scala-internals
I've actually just submitted a one-byte PR, namely a carefully-placed `0`.

Anyway, this looks great, Haoyi! I'd love to see someone build a syntax highlighter on top of it... :)

Cheers,
Jon
Jon Pretty | @propensive

Haoyi Li

unread,
Nov 29, 2014, 3:04:11 PM11/29/14
to scala-internals
This is great, I had started the exact same task some time ago, also using pb2, but never got to finish it 

This started from your code base, though by this point it doesn't look much like it any more =P

I'm curious to see how you solved a particular case with type pattern in a case clause which looks like: case a: b => c => d

Easy. The spec is wrong; it is parsed as a CompoundType and not a Type. 

"The spec is wrong" is a common theme throughout the writing of this parser. At some point, I wonder why we even bother: if it's not executable, it's bound to fall out of date, especially for something as complex as Scala's grammar. I know other communities like e.g. the Python people generate their parser from EBNF using code-gen, and to begin with their grammar is far simpler.

Simon Ochsenreither

unread,
Nov 29, 2014, 4:31:07 PM11/29/14
to scala-i...@googlegroups.com
Wow, impressive!

Now it's much easier to have a look at how much impact certain changes might have (like my favorite one, making anonymous partial function stuff less painful).

Jon Pretty

unread,
Nov 29, 2014, 4:51:25 PM11/29/14
to scala-internals
Have you got some more detail on this idea, Simon?

Cheers,
Jon

On 29 November 2014 at 21:31, Simon Ochsenreither <simon.och...@gmail.com> wrote:
Wow, impressive!

Now it's much easier to have a look at how much impact certain changes might have (like my favorite one, making anonymous partial function stuff less painful).

--
You received this message because you are subscribed to the Google Groups "scala-internals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scala-interna...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Jon Pretty | @propensive

Simon Ochsenreither

unread,
Nov 29, 2014, 5:27:31 PM11/29/14
to scala-i...@googlegroups.com

Jon Pretty

unread,
Nov 29, 2014, 5:47:53 PM11/29/14
to scala-internals, type...@googlegroups.com
Cool! I'm sorry I missed that the first time round...

If you have a diff/PR, it would be great to experiment with it in the Typelevel fork...

Cheers,
Jon

--
You received this message because you are subscribed to the Google Groups "scala-internals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scala-interna...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Jon Pretty | @propensive

Matthew Pocock

unread,
Nov 29, 2014, 5:52:31 PM11/29/14
to scala-i...@googlegroups.com
Respect.

--
You received this message because you are subscribed to the Google Groups "scala-internals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scala-interna...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Dr Matthew Pocock
Turing ate my hamster LTD

Integrative Bioinformatics Group, School of Computing Science, Newcastle University

skype: matthew.pocock
tel: (0191) 2566550

Martin Egri

unread,
Nov 29, 2014, 9:35:34 PM11/29/14
to scala-i...@googlegroups.com
I just want to chime in and express how cool I think this is. Been following your work on/with scala.js since its inception and it's made me that much more enthusiastic about working with Scala. There's a great deal of potential here :)

Simon Ochsenreither

unread,
Nov 29, 2014, 11:08:15 PM11/29/14
to type...@googlegroups.com, scala-i...@googlegroups.com
Sure, I'll have a look! I'd be happy to experiment with it in Typelevel, it's just that I'm currently extremely busy with other stuff I finally want to get done.

Jon Pretty

unread,
Nov 29, 2014, 11:30:11 PM11/29/14
to Simon Ochsenreither, type...@googlegroups.com, scala-internals
Sure - I know that feeling... it's the middle of the night where you are too, right? ;)

If it makes things easier, I'm happy to work with a crude diff against an older version of Scala, if that's all you've got -- it doesn't need to be a beautifully-crafted PR.

Bedtime, I think.
Jon

On 30 November 2014 at 04:08, Simon Ochsenreither <simon.och...@gmail.com> wrote:
Sure, I'll have a look! I'd be happy to experiment with it in Typelevel, it's just that I'm currently extremely busy with other stuff I finally want to get done.

--
You received this message because you are subscribed to the Google Groups "Typelevel Development List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to typelevel+...@googlegroups.com.
To post to this group, send email to type...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/typelevel/d9111788-7c1c-4c28-8b47-933c4ba5f361%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Jon Pretty | @propensive

Rex Kerr

unread,
Nov 30, 2014, 8:36:55 AM11/30/14
to scala-i...@googlegroups.com
This is fantastic!  I've been missing something like this for flexible code generation for approximately since I first started using Scala.  (Idea is if you edit a generated file, you'd like to at least have some clue about how it differs from the source so that patches can be either backported or at least suggested.  Trying to do this based on pattern-matching alone has been very fragile to cosmetic changes.)

Can't wait to get enough time to try it out.  Thanks!

  --Rex

--
You received this message because you are subscribed to the Google Groups "scala-internals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scala-interna...@googlegroups.com.

Haoyi Li

unread,
Nov 30, 2014, 9:01:58 AM11/30/14
to scala-internals
Another day, another blessing. After a few hour's effort, I have XML support working as of diff a0c39cdb0639a64c3277c8c0b269dd4372892331, at a cost of 120LOC. Since most projects don't use much XML, I threw https://github.com/lift/framework into the mix of parsed projects and made everything pass. 

It might still have some un-discovered issues, but it certainly has fewer than the language specification. Onward the steady march of progress!

Eugene Burmako

unread,
Nov 30, 2014, 9:24:12 AM11/30/14
to <scala-internals@googlegroups.com>
How hard would it be to modify the parser so that it would also support parsing in quasiquote mode?

At the moment, scalac uses an unpleasant hack to handle unquotes. If we see q"$x + $y", we generate a synthetic string that looks something like "dummy$1 + dummy$2", then parse it as normal Scala code and then traverse the resulting tree, replacing dummies in various locations with unquotees. That's really irregular and leads to a number of hard to fix issues. Can we easily extend the pb2-based parser to natively account for unquotes?


Haoyi Li

unread,
Nov 30, 2014, 9:32:49 AM11/30/14
to scala-internals
How hard would it be to modify the parser so that it would also support parsing in quasiquote mode?

Grzegorz Kossakowski

unread,
Nov 30, 2014, 6:40:29 PM11/30/14
to scala-internals
Hi Haoyi!

Impressive work! Have you considered contributing fixes to the spec for the most glaring inconsistencies you found?

Also, have you tested error reporting in case of files that have syntax errors?



--
You received this message because you are subscribed to the Google Groups "scala-internals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scala-interna...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Grzegorz Kossakowski
Scalac hacker at Typesafe
twitter: @gkossakowski

Jason Zaugg

unread,
Nov 30, 2014, 6:43:24 PM11/30/14
to scala-i...@googlegroups.com
On Mon, Dec 1, 2014 at 12:32 AM, Haoyi Li <haoy...@gmail.com> wrote:
How hard would it be to modify the parser so that it would also support parsing in quasiquote mode?

It already does:


Unit test that only passes because we're parsing string interpolations correctly:


Not that hard at all ^_^

We don't know what q"foo" means until we typecheck `q`, I don't see how we can eagerly parse the contents of the strings, unless it is some sort of "optimistic parsing" that can be rolled back later.

-jason

Jason Zaugg

unread,
Nov 30, 2014, 6:46:39 PM11/30/14
to scala-i...@googlegroups.com
On Mon, Dec 1, 2014 at 12:01 AM, Haoyi Li <haoy...@gmail.com> wrote:
Another day, another blessing. After a few hour's effort, I have XML support working as of diff a0c39cdb0639a64c3277c8c0b269dd4372892331, at a cost of 120LOC. Since most projects don't use much XML, I threw https://github.com/lift/framework into the mix of parsed projects and made everything pass.

Another good source of test cases (not just for XML) might be the parser test suite of the IntelliJ Scala plugin.

-jason

Jon Pretty

unread,
Nov 30, 2014, 6:54:02 PM11/30/14
to scala-internals
My experience of the error reporting wasn't great, actually. The line that was printed out was actually fine, and the error was somewhere completely different (which I had to find by progressive deletion).

Jon
Jon Pretty | @propensive

Haoyi Li

unread,
Nov 30, 2014, 7:04:43 PM11/30/14
to scala-internals
Have you considered contributing fixes to the spec for the most glaring inconsistencies you found?

Nope. As I said, I don't see how the spec as written could ever stay correct in future as things progress. In fact, I don't even trust my self to edit the spec correctly *the first time* without a test suite to check things for me! I'm not an expert in the Scala lexical spec and, despite writing this parser, have managed to avoid becoming one thanks to the magic of automated testing. 

What I want is a concise, machine-checkable specification of what things should parse. Which is exactly what scala-parser is ^_^.

It's only 27kb of code, which is in fact 2.5kb shorter than the spec's lexical description (this and this put together) which sits at 29.5kb of blurb. Shorter, more correct, more complete, less ambiguous, and machine-checkable. What's the point of giving that all up to put it in some verbose, ambiguous, pseudo-EBNF-english language? Who would it serve? If someone like @paulp wants a reference to start writing a working parser, he's not going to look at the spec, he's gonna look at the scala-parser and see what it does, which is in fact what he did.

> Also, have you tested error reporting in case of files that have syntax errors?

There is none. Parboiled's error reporting was worse than non-existent (i.e. actively harmful and malicious) up until a few hours ago when this https://github.com/sirthias/parboiled2/issues/102 landed. I haven't had chance to go try it out because... I've been sleeping. I'll take a look at that some time in future. Does partest have good existing neg tests with expected error positions I can use?

We don't know what q"foo" means until we typecheck `q`, I don't see how we can eagerly parse the contents of the strings, unless it is some sort of "optimistic parsing" that can be rolled back later.

That's not the parser's problem. That's the responsibility of whoever's manipulating the AST later. The parser should just care about the $ and ${} and $$ escapes and treat them appropriately. And that's what it does!

I mean, we also don't know what doStuff() means either. It could be a macro and give back arbitrary trees, or wipe your hard drive the 7th time you compile. But that doesn't mean that the parser should perform whole-program side-effect analysis to find out. 

--

Jason Zaugg

unread,
Nov 30, 2014, 7:24:18 PM11/30/14
to scala-i...@googlegroups.com
On Mon, Dec 1, 2014 at 10:04 AM, Haoyi Li <haoy...@gmail.com> wrote:

We don't know what q"foo" means until we typecheck `q`, I don't see how we can eagerly parse the contents of the strings, unless it is some sort of "optimistic parsing" that can be rolled back later.

That's not the parser's problem. That's the responsibility of whoever's manipulating the AST later. The parser should just care about the $ and ${} and $$ escapes and treat them appropriately. And that's what it does!

I thought Eugene was asking whether this parser could perform domain-specific parsing of the contents of the quasiquote interpolator, rather than generally parsing the string interpolation syntax. But perhaps I misunderstood.

-jason

Haoyi Li

unread,
Nov 30, 2014, 7:31:31 PM11/30/14
to scala-internals
AFAICT he was talking about the $ ${} $$ blocks ^_^

Denys Shabalin

unread,
Dec 1, 2014, 11:27:25 AM12/1/14
to scala-i...@googlegroups.com
What Eugene meant is fact is related to how quasiquotes are desugared in Scala:

1. First a source code gets parsed by the regular parser and treats q"" interpolators as regular string interpolators and desugards them into StringContext(...).q(...). 

2. Once q macro gets expanded it reconstructs a code enclosed in quotes and does some domain-specific parsing of the contents (via customised parser). In particular it treats `$foo` and `..` as special domain-specific entities. This is implemented through construction of textual representation of quasiquote with all $foos replaced with $random$name$1. For example q"$foo + $bar" is parsed in separate parser instance as "$random$name$1 + $random$name$2".  After the parsing through customised parser is complete quasiquote implementation walks parsed tree and replaces all random names with corresponding input arguments during reification of parsed tree into result tree.

The problem with this approach is that relies on very fragile textual code reconstruction that might confuse malicious user input with unquoted arguments. To improve this situation in scala.meta our current plan is to:

1. Have a finer and more delimited interface between parser and scanner that would on very high level look like:
   
type Tok
def parse[T <: Tree](toks: Seq[Tok]): T = ...
def lex(input: Seq[Byte]): Seq[Tok]) = ...

Where T is a "context" you would like to parse in. So for example parse[Term] would parse tokens as if they were used in token position.

Tokens are lightweight objects that are defined by their current type and precise range position. 

2. There are special kinds of tokens for ".." and $foo that allows to embed quasiquote parser into primary language parser without need to split them apart. This would also remove need for textual reconstruction hack we have right now as quasi quotes will be able to pass customised sequence of tokens for parsing of their snippets.

The problem with parboiled parser approach is that it won't let us do that any longer as parsing done directly on byte input which requires code reconstruction hack for quasiquote implementation. Paraboiled-based parser also wouldn't let us expose tokens in any way either which might be useful as an underlying primitive for code refactoring and rewriting tools which need to reason about code one level lower than information available in the trees themselves). 

Cheers,
Denys

Denys Shabalin

unread,
Dec 1, 2014, 11:28:03 AM12/1/14
to scala-i...@googlegroups.com
So for example parse[Term] would parse tokens as if they were used in token position.

*term position

Adriaan Moors

unread,
Dec 1, 2014, 2:35:03 PM12/1/14
to scala-i...@googlegroups.com
Can we use two parboiled parsers? One turns characters into tokens, and the other takes tokens to an AST?

Mathias Doenitz

unread,
Dec 1, 2014, 2:40:55 PM12/1/14
to scala-i...@googlegroups.com
Why do we need dedicated tokens again?
A fine-grained AST covering *all* the input (incl. whitespace, comments, etc.) and including all positional info should be fine.

The goal should be to allow for a full round-trip of any legal Scala compilation unit source to the AST and back.
If we have this, what do we need an intermediate token level for?

Cheers,
Mathias

---
mat...@parboiled.org
http://www.parboiled.org
> You received this message because you are subscribed to a topic in the Google Groups "scala-internals" group.
> To unsubscribe from this topic, visit https://groups.google.com/d/topic/scala-internals/q-pJ4aMNgmo/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to scala-interna...@googlegroups.com.

Eugene Burmako

unread,
Dec 1, 2014, 3:18:52 PM12/1/14
to <scala-internals@googlegroups.com>
Sometimes it is also desireable to look into syntactic trivia that constitutes an AST. For instance, `class C` and `class C {}` might be represented by structurally identical ASTs, but maybe someone will actually want to discern those. Tokens would provide a high-level way of achieving that.

Haoyi Li

unread,
Dec 1, 2014, 3:32:31 PM12/1/14
to scala-internals

Couldn't we just not throw away the information that makes then different? There's no reason they should be identical ASTs. Looking at the token stream seems like a silly workaround. For example the body of the class could be an Option.

Mathias Doenitz

unread,
Dec 1, 2014, 3:39:44 PM12/1/14
to scala-i...@googlegroups.com
Yes, that’s why I’d want to go for an AST that spans the complete input and enables a full round-trip.
That would entail keeping all the difference between `class C` and `class C {}`.

Matthew Pocock

unread,
Dec 1, 2014, 3:46:41 PM12/1/14
to scala-i...@googlegroups.com
I very much agree with the sentiment of having a genuinely 'complete' AST that doesn't elide syntactic differences that have identical semantics (including retaining whitespace, comments and elidable brackets/braces/boxes). It's an AST of the source code, not of some abstract interpretation of it. I would hope that with future whole-module optimization,we would recover most of the performance enhancements you get from having a simplified AST from the full AST + semantic mapping functions.

Matthew

Eugene Burmako

unread,
Dec 1, 2014, 3:48:38 PM12/1/14
to <scala-internals@googlegroups.com>
1) We tried something like that, but then the convenience of the happy path (when the user doesn't care about whether the empty body actually has or doesn't have braces) is compromised. 
2) We also considered having different AST nodes for empty and non-empty templates (actually, the situation was different, but the idea was the same). This also makes analyzing such trees less convenient for users that don't need such level of details.
3) Yet another approach that we tried is having boolean flags on certain trees (e.g. Template.hasBraces), but this requires manual handling of every such special case and ended up being hard to abstract over.

Then, Denys has come up with the idea of tracking information about trivia in tokens and that seemed to solve every problem that we had: A) positions, B) prettyprinting, C) uniform interface for analyzing syntactic quirks. What is appealing in this approach is that there is no need to have separate mechanism to address A, B and C - they all automatically follow from having tokens for AST nodes. What do you think?

Haoyi Li

unread,
Dec 1, 2014, 3:54:44 PM12/1/14
to scala-internals

@denys with the range positions and the source code, you can do all the postprocessing you want to find the individual tokens in an AST node. The parser could provide them, but it doesn't seem necessary if someone else can do it in a separate phase. 

We could also just collect this information as part of the initial AST if so desired. Remember, this parser is actually sane! This parser isn't a 7000LOC beast full of mutable state which you don't dare touch because you're scared of breaking things. We can modify it in a few hours! If we're going to be using it for something, we should stop thinking about "working around and reconstructing" and start thinking about "making it parse what we want" in the first place. 

I don't fully understand your other request - parsing ".." and "$foo" - but here's my two interrpetations:

- You want to hardcode support for quasiquotes q"..." and q"""...""" directly into the main scala-parser, so you can parse the whole thing in one go and not have to do a two phase parsing. Easy! But you shouldn't do it! Macros do macro things and parser does parser things, and never shall thy meet.

- You want to have a fork of the parser that has additional syntactic rules that will be used to parse the stuff inside q"..." and q"""...""" as a second phase. Also easy! This is something I can get behind

@xeno_by I think this request is a fundamental problem with your data model/abstraction. You don't know whether it's important or not, and thus you're exposing this weird side-channel by which the user can go behind the back of the parser and funge with things they really shouldn't have to care about. I don't really see why making the thing an Option or putting in a flag compromises anything. 

Either its important and people should care, or it's not important and people shouldn't. e.g. if the Boolean is important and we put it in the AST, constructing the thing is unchanged (hooray default arguments) and pattern matching just requires an additional , _ (maybe even unchanged, because we can overload extractors i think?) so it's not the end of the world.

I can also imagine having separate parsers for separate use cases spitting out ASTs with different level of detail. At 700LOC, it is entirely feasible to maintain two or three in parallel, because it's 700LOC of very-not-confusing logic.

Eugene Burmako

unread,
Dec 1, 2014, 4:02:52 PM12/1/14
to <scala-internals@googlegroups.com>
With the idea of tokens that underlie all AST nodes, no information is lost anyway - it's just hidden behind a lexical layer of the API. 

So if you want just an AST, as in abstract syntax trees, you do `import scala.meta._` (or `import scala.meta.syntactic._`). If you want full lexical information, you do `import scala.meta.lexical._`, and that gives you something like `Tree.tokens`. Finally, if you want full semantic information, you do `import scala.meta.syntactic._`, which enriches your vocabulary with things like `Ref.resolve`, `Term.tpe` or `Class.parents`. This, I think, nicely manages the complexity of the API.

There is definitely tension between concreteness and abstractness of syntax trees. Increasing concreteness makes it easier to state precise facts about the structure of the programs, whereas increasing abstractness makes it easier to operate on trees. Finding the sweet spot is not easy, and I think it's still an open question, though having more technology to balance things (e.g. tokens and quasiquotes) is definitely welcome.

Btw we also considered having both concrete and abstract syntax trees, where the former would be unattributed (pre-typer) and the latter would be attributed (post-typer). This, however, brings a whole new bunch of problems. How does one mix the two, e.g. when we need to insert macro arguments (post-typer) into a quasiquote (pre-typer)? How does one treat both concrete and abstract trees uniformly, e.g. in traversals or helper functions? After thinking about these problems for a while, we decided to stick with the original idea of having a single representation for both pre-typer and post-typer trees.

Mathias Doenitz

unread,
Dec 1, 2014, 4:43:47 PM12/1/14
to scala-i...@googlegroups.com
A few more thoughts:

As a PEG parser pb2 operates on the character level without a preprocessing lexer phase. Apart from the simplification of having to write only one thing (the parser) rather than two (lever + parser) this has other benefits.
For example, when we re-implement parse error recovery (which has not yet been ported from parboiled 1.x: https://github.com/sirthias/parboiled2/issues/42) we can do things on the character level that a lexer+parser can’t do, like recovering from typos in keywords.
Single char errors (like `objet`, `obbject` or `obiect` instead of `object`) can be properly recognised and “fixed”.

> With the idea of tokens that underlie all AST nodes, no information is lost anyway - it's just hidden behind a lexical layer of the API.

Yeah, but why introduce a lexical layer if one isn’t needed?
AFAIU tokens only serve two purposes:
1. increase look-ahead reach (LL(k) can’t look ahead beyond arbitrarily long identifiers without a tokenizer)
2. “compress” the input for memory/performance reasons

(1) isn’t an issue in PEGs and if (2) isn’t required because the parser is “fast enough” then a lexer doesn’t add any real value.
Everything a lexer does we can also do in the parser itself. I don’t see any exceptions.

Rex Kerr

unread,
Dec 1, 2014, 5:17:01 PM12/1/14
to scala-i...@googlegroups.com
In order to preserve code formatting etc., it's useful to have generators that function at a fairly high level that encapsulate most common formats.  I can't quite tell whether Tokens are this.  It's certainly not the name I'd have chosen, but Denys description sounds similar.

Anyway, when building an AST, I think these generators are what should be built.  Maybe they need the exact literal source that part of the tree was from, or maybe not; easily-manipulable trees aren't always in a source-code-like order.

But it sounds totally fine to me to treat this as a completely separate step, since for many applications you just want the AST and don't care how you got it.  So if you happen to start building an AST, hopefully you can just avoid choosing a design that would make this very difficult.  (Whether this is what Eugene is talking about I can't quite tell either.)

Actually, a really nice AST design would be one that makes it easy to layer on all sorts of additional logic about the tree and properties thereof.  Working with trees always seems such a slog, in large part because the information one wants is buried, can't be found and stored, goes out of date without being automatically fixed, etc..  Not sure if scala.rx has the performance characteristics needed for that.

Anyway, as a second step you want to be able to save ASTs and bind source code as a diff against an AST and then see if you can't remodel the AST rather than recomputing all the difficult-to-compute parts of it from scratch.  If it is possible to do this, then recompile times go to ~zero (assuming parse-plus-bind is fast, which it should be as it's mostly text manipulation).

To me, this is the biggest draw of having a simple parser: the likelihood that you can get a sufficiently rich intermediate representation to save a ton of hard work because you recognize trivial differences in the simple stuff.  It's not easy, though, because to be fast trees need to know why certain decisions were made so they don't have to recompute them unless the "whys" change.

  --Rex

Rex Kerr

unread,
Dec 1, 2014, 5:18:57 PM12/1/14
to scala-i...@googlegroups.com
Er, in case it's not clear: the reason you want a simple (and fast!) parser is because you (probably) always have to parse. I guess you could try doing dumb text matching, but I've tried that and it seemed fragile. Once you've parsed, though, you might be able to skip a huge amount of work.

  --Rex

Adriaan Moors

unread,
Dec 1, 2014, 5:24:25 PM12/1/14
to scala-i...@googlegroups.com
Thanks, Haoyi and Mathias for explaining! 

Your approach makes a lot of sense to me, and I think it would be great to start by incorporating this into scala.meta to gain some real-life experience with it before we decide to integrate it into scalac (I'm a little hesitant to have the parser rely on macros due to bootstrapping requirements and the experimental state of macros). 

How efficient can the abstract representation of the syntax be made given the requirement of round tripping to the same concrete syntax?

Mathias Doenitz

unread,
Dec 1, 2014, 5:59:15 PM12/1/14
to scala-i...@googlegroups.com
> I think it would be great to start by incorporating this into scala.meta to gain some real-life experience with it before we decide to integrate it into scalac (I'm a little hesitant to have the parser rely on macros due to bootstrapping requirements and the experimental state of macros).

Yeah, even though Haoyi’s work looks promising I think it is way to early to think about moving scalac itself away from a hand-rolled parser.
First three things would have to be proven outside of scalac:

1. Parsing performance really is good enough.
Even though pb2 produces parser code that performs well a good hand-crafted parser will always be faster.
The question how close we can get by further optimising the generator as well as the grammar needs to be answered.

2. Error reporting is good enough.
This is an area that likely needs a bit more work, in pb2 as well as the grammar.
But my gut feeling is that it’s possible to get there.

3. We have “the right” AST design.
This has little to do with pb2 as it poses almost no restrictions in this regard.
We can design the AST in any way we like and keep/drop any info we like from the underlying source.
I guess the challenge here is the right trade-off between building upon the experiences with the current AST without (unintentionally) confining oneself too much to existing "legacy structures” that serve no deeper purpose (like, possibly, a separate token level).


> How efficient can the abstract representation of the syntax be made given the requirement of round tripping to the same concrete syntax?

That is a good question and depends a lot on how exactly the AST is designed and what our priorities are here.
It might be that certain factors influencing AST design have changed a bit since the foundations of the current scalac design were laid.
For example, given todays hardware and typical memory resources one might strike the typical memory vs. performance or mutability vs. immutability balance differently than back then.

Martin Mauch

unread,
Dec 1, 2014, 6:34:40 PM12/1/14
to scala-i...@googlegroups.com
Hi Rex,

actually there is a Scala Parser that is also based on PEGs and does incremental parsing:
https://github.com/Eliah-Lakhin/papa-carlo
Eliah and me even tried to create a Scala parser with Papa Carlo (or better a converter that takes a ANTLR grammar and converts it to Papa Carlo). Unfortunately we didn't get that far because of time constraints and that we both didn't know the ANTLR framework very well:
https://groups.google.com/forum/m/#!topic/papa-carlo/NWo1p6jDKtE
With a working and understandable PEG grammar for Scala it should be very doable to create an incremental Scala parser. IMO, it would be great to unify the incremental approach of Papa Carlo with the mature and efficient pb2, even better so if using a FRP library that could be leveraged to reactively update type information and maybe even compiled bytecode.

If someone wants to do such an undertaking I can try to free some time and join the effort :D

@Mathias:Did you already have a look at Papa Carlo? Could something like this be built upon pb2?



Haoyi Li

unread,
Dec 1, 2014, 6:36:03 PM12/1/14
to scala-internals
Yeah, even though Haoyi’s work looks promising I think it is way to early to think about moving scalac itself away from a hand-rolled parser.

Agree 100%. Mathias is also right that all the talk about ASTs is kinda irrelevant. The parboiled parser can spit out whatever AST you want, given about 1 day worth of effort. Also, we can modify the parser do accommodate whatever special syntax we want: special casing for q"", special casing for ..$ inside the grammar, etc. are all ~2hrs worth of effort.

On the other hand, you guys should definitely try it out. Given the amount of effort that has been put into reverse-engineering the current parser to undo the various de-sugarings that it's doing, a tiny fraction of that will answer everyone's questions about "can this parser do X" =P 

In terms of perf, it can parse *everything* in scala/scala, lift, scalaz, scalajs, playframework and shapeless in 15 seconds. Naturally this time will go up as we try to do fancy things like generating ASTs with range positions, but given that we're generating whatever AST we want will take about the same amount of time to instantiate all those objects, I don't think performance in general will be a problem.

Error reporting needs work. I'm pretty sure we can make it awesome and almost automated with some amount of work, but we're not there yet.

Martin Mauch

unread,
Dec 1, 2014, 6:41:15 PM12/1/14
to scala-i...@googlegroups.com
Btw: As a developer I would very much prefer a compiler that is slower on initial parse but does speedy incremental updates. So the last x% performance that a hand written parser can give you might not even be an issue if we can get reactive compilation in exchange.

Haoyi Li

unread,
Dec 1, 2014, 6:52:54 PM12/1/14
to scala-internals
Since everyone is talking about performance, does anyone know how much of the time in the compiler and macros is spent parsing? My impression is that the vast vast vast majority of the time is spent in the typechecker, so much so that even if parsing took 2x as long it'd be no big deal anyway.

On Mon, Dec 1, 2014 at 3:41 PM, Martin Mauch <martin...@gmail.com> wrote:
Btw: As a developer I would very much prefer a compiler  that is slower on initial parse but does speedy  incremental updates. So the last x% performance that a hand written parser can give you might not even be an issue if we can get reactive compilation in exchange.

martin odersky

unread,
Dec 1, 2014, 7:02:01 PM12/1/14
to scala-internals
On Tue, Dec 2, 2014 at 12:52 AM, Haoyi Li <haoy...@gmail.com> wrote:
Since everyone is talking about performance, does anyone know how much of the time in the compiler and macros is spent parsing? My impression is that the vast vast vast majority of the time is spent in the typechecker, so much so that even if parsing took 2x as long it'd be no big deal anyway.

Yes, parsing is pretty insignificant compared to the other tasks of a compiler. But it's more important in IDEs because (1) we re-parse on every keystroke, and (2) sometimes a whole project or set of projects has to be parsed for initial indexing (e.g. to figure out where the toplevel classes are).  

That said, the current dotty parser (hand-written, 2100 lines including error reporting, accurate positions and tree construction) achieves several hundred thousand lines a second. So parboiled still has some way to go to beat that :-)

Cheers

 - Martin



On Mon, Dec 1, 2014 at 3:41 PM, Martin Mauch <martin...@gmail.com> wrote:
Btw: As a developer I would very much prefer a compiler  that is slower on initial parse but does speedy  incremental updates. So the last x% performance that a hand written parser can give you might not even be an issue if we can get reactive compilation in exchange.

--
You received this message because you are subscribed to the Google Groups "scala-internals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scala-interna...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "scala-internals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scala-interna...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Martin Odersky
EPFL

Jon Pretty

unread,
Dec 1, 2014, 7:57:57 PM12/1/14
to scala-internals
And (in theory) many sources files can be parsed in parallel, which isn't true of later phases.

Cheers,
Jon
Jon Pretty | @propensive

Haoyi Li

unread,
Dec 1, 2014, 8:25:41 PM12/1/14
to scala-internals
That said, the current dotty parser (hand-written, 2100 lines including error reporting, accurate positions and tree construction) achieves several hundred thousand lines a second. So parboiled still has some way to go to beat that :-)

Can you run the dotty parser against the projects I used to benchmark scala-parser? I'm only reach 50kLOC a second so I'm wondering what you're doing that's 10x faster, or whether the subset of things you're parsing are just easier to parse =P

FWIW, The scala compiler runs about 0.5-0.05kLOC a second, so the parsing portion of the compiler is 1%-0.1% of the compilation pipeline.


Guillaume Massé

unread,
Dec 1, 2014, 11:40:38 PM12/1/14
to scala-i...@googlegroups.com
this project is an incremental parser also using PEG grammar: http://lakhin.com/projects/papa-carlo

Mathias Doenitz

unread,
Dec 2, 2014, 3:18:30 AM12/2/14
to scala-i...@googlegroups.com
Guillaume,

as I’ve commented in this ticket: https://github.com/sirthias/parboiled2/issues/114
papa carlo appears to be about 1000 times slower than pb2, which defeats the benefit of the incremental approach in all real-world scenarios,
even if papa carlo exhibits constant vs. linear runtime in an IDE setting.
> You received this message because you are subscribed to a topic in the Google Groups "scala-internals" group.
> To unsubscribe from this topic, visit https://groups.google.com/d/topic/scala-internals/q-pJ4aMNgmo/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to scala-interna...@googlegroups.com.

martin odersky

unread,
Dec 2, 2014, 3:30:30 AM12/2/14
to scala-internals
On Tue, Dec 2, 2014 at 2:25 AM, Haoyi Li <haoy...@gmail.com> wrote:
That said, the current dotty parser (hand-written, 2100 lines including error reporting, accurate positions and tree construction) achieves several hundred thousand lines a second. So parboiled still has some way to go to beat that :-) 

Can you run the dotty parser against the projects I used to benchmark scala-parser? I'm only reach 50kLOC a second so I'm wondering what you're doing that's 10x faster, or whether the subset of things you're parsing are just easier to parse =P

It parses the Dotty variant of Scala. There are a few things it cannot handle. XML literals is one (we will use string interpolation for them). Early definitions is another (we will have trait parameters instead). But by and large, it's the same language. I doubt the differences matter in terms of parse speed.

Some more data: The parser itself is about 1500 non-comment non-blank code lines. The tokenizer comes in at another 750 lines. Without tree construction, I have measured it at about 500KLoc/sec. With tree construction, it's less but still more than 100KLoc/sec.

The parser is very maintainable. What I always do is this: 

 - Start with an accurate grammar
 - Massage so that the grammar is mostly LL(1)
 - Translate the grammar to code by slavishly sticking to the following schema: 

   - {...} becomes while
   - [...] becomes if 
   - ... | ... becomes if-then-else

In front of each parse method, stick the grammar it parses as a doc comment. Make sure the snippets together correspond one-to-one to the official published grammar. 

Constructing a parser for full Scala takes about two days using this method. The hard part is not the parsing, but deciding what trees to generate and maintaining accurate range positions. What's important for maintainability is that this simple scheme is not disturbed by "optimizations". Over the years the scalac parser underwent such optimizations to reduce code duplication, make it more functional, etc. That's no doubt well-meant but it destroys the clear link between grammar and code. And once you destroy that link, the parser becomes the de-facto grammar, which leads to the mess we are in with the scalac parser.

Cheers

 - Martin

/Users/odersky/workspace/dotty/src/dotty/tools/dotc/parsing> cloc Parsers.scala
       1 text file.
       1 unique file.                              
       0 files ignored.

http://cloc.sourceforge.net v 1.56  T=0.5 s (2.0 files/s, 4222.0 lines/s)
-------------------------------------------------------------------------------
Language                     files          blank        comment           code
-------------------------------------------------------------------------------
Scala                            1            220            396           1495
-------------------------------------------------------------------------------



--
Martin Odersky
EPFL

martin odersky

unread,
Dec 2, 2014, 3:35:14 AM12/2/14
to scala-internals
--
Martin Odersky
EPFL

martin odersky

unread,
Dec 2, 2014, 3:37:23 AM12/2/14
to scala-internals
On Tue, Dec 2, 2014 at 9:34 AM, martin odersky <martin....@epfl.ch> wrote:

And, looking at again I see that the current parser still handles XML. But that will be removed later, once the rewrite tool can translate XML to interpolated strings.

 - Martin



--
Martin Odersky
EPFL

martin odersky

unread,
Dec 2, 2014, 3:51:00 AM12/2/14
to scala-internals
On Tue, Dec 2, 2014 at 2:25 AM, Haoyi Li <haoy...@gmail.com> wrote:
That said, the current dotty parser (hand-written, 2100 lines including error reporting, accurate positions and tree construction) achieves several hundred thousand lines a second. So parboiled still has some way to go to beat that :-)

Can you run the dotty parser against the projects I used to benchmark scala-parser? I'm only reach 50kLOC a second so I'm wondering what you're doing that's 10x faster, or whether the subset of things you're parsing are just easier to parse =P

FWIW, The scala compiler runs about 0.5-0.05kLOC a second, so the parsing portion of the compiler is 1%-0.1% of the compilation pipeline.

Yes, and in fact it's quite a mystery where some of the rest of the time is spent. It's almost like searching for dark matter :-) 

And also, just to clarify: I think your parser is fantastic work, and it would be great to have it as a reference parser in the standard library. For the compiler itself, I still prefer hand-written, but there are many other tools that could profit from a standard parser that is decoupled from a full compiler and yours seems to fit the bill nicely! Having two parsers is not a problem as long as we make sure they implement the same grammar.



--
Martin Odersky
EPFL

Som Snytt

unread,
Dec 2, 2014, 5:28:35 AM12/2/14
to scala-internals
> make sure they implement the same grammar.

I thought it would be useful to reduce the grammar duplication in the spec, in the text and in the summary, but really scaladoc should just gather the grammar from the parser doc comments.

Automated test generation would be nice, too.  I felt it was over-refined to add a test that 0XF means something and 0x does not (for the pb2 and scala2 versions, respectively).

I wonder if multiple parsers can help with error reporting.  I was just looking at how the brace-healing pass affects my errors.  One could imagine options to reparse with different parsers or error recovery strategies. -Xerror-help=11.

Denys Shabalin

unread,
Dec 2, 2014, 6:08:53 AM12/2/14
to scala-i...@googlegroups.com
@HaoyiLi

I don't want to interpret quasiquote contents at parse time. What I strive for is to simplify the implementation that happens at macro expansion time. In more concrete terms lets analyse what all of the code in current quasiquote implementation does.

In particular we are interested in quasiquote macro code:


and companion runtime helpers:


When quasiquote are used the following code executes:

1a. (compile-time) Textually reconstruct source code of the snippet within quasiquote code by joining string interpolation parts macro gets as input with $fresh$name$1 in between (≈30LOC)
1b. (compile-time) Parse it with custom subclass of parser that treats $fresh$name$1 identifiers as special kind of tokens (≈ 300 LOC)
2. (compile-time) Traverse parser tree and turn it into code that generates this trees using runtime companions at compile-time to be able to distinguish higher-level language constructs which has already been desugared after parsing (≈900 LOC)
3. (run-time) Execute mix of companion helpers and tree constructors at runtime (≈1000 LOC)

I believe that first part of the puzzle (1b) can be embedded in parser given a special tokens for ".." and $foo exist. This embedding won't overlap with actual scala parsing as changes are quite miniminal (e.g. when you see $foo produce Hole(...) node, when you see .. produce Splicing(...) node) and would add just a few production rules to the grammar. Quasiquotes also won't have to textually reconstruct (1a) snippets of code, they can pre-scan each string interpolation part and pass finely tuned joined sequence of tokens to the parser (with special hole tokens in between tokens produced by parts). 

Second part (reification of trees) should be automatically generated through an implicit macro with minimal (≤200 LOC) manual intervations to support holes and splicing. Amount of code and effort needed to get this right is mind with current ASTs is mind-boggling.

Third part is completely unnecessary if trees are close to the syntax productions rules. More specifically no desugaring can happen during parsing. And that has been one of our priorities in scala.meta AST design. 

Overall current low-level trees led to massive and overly complicated implementation that I'm personally not proud of. I believe that tightly instrumented integration between parsing and trees should decrease amount of code needed to implement quasi quotes by ≈10x factor and I'm afraid that such amount of reduction will not be possible if we stick to current approach. Unfortunately I don't see how will you be able to use parboiled2 parser to do anything but old and ugly "construct code snippet textually and then parse it as if it was separate compilation unit" trick.

Haoyi Li

unread,
Dec 2, 2014, 6:26:57 AM12/2/14
to scala-internals
> 1a. (compile-time) Textually reconstruct source code of the snippet within quasiquote code by joining string interpolation parts macro gets as input with $fresh$name$1 in between (≈30LOC)
> 1b. (compile-time) Parse it with custom subclass of parser that treats $fresh$name$1 identifiers as special kind of tokens (≈ 300 LOC)

These two steps can be combined into a single step, which you can make output whatever magic tree you want. ~10LOC

>2. (compile-time) Traverse parser tree and turn it into code that generates this trees using runtime companions at compile-time to be able to distinguish higher-level language constructs which has already been desugared after parsing (≈900 LOC)

This will completely disappear. We can parse right to the high level constructs immediately.

> 3. (run-time) Execute mix of companion helpers and tree constructors at runtime (≈1000 LOC)

I have no idea what this is but since you say it's unnecessary if the AST matches the rules, this will go away too.
 
> I believe that first part of the puzzle (1b) can be embedded in parser given a special tokens for ".." and $foo exist. This embedding won't overlap with actual scala parsing as changes are quite miniminal (e.g. when you see $foo produce Hole(...) node, when you see .. produce Splicing(...) node) and would add just a few production rules to the grammar. Quasiquotes also won't have to textually reconstruct (1a) snippets of code, they can pre-scan each string interpolation part and pass finely tuned joined sequence of tokens to the parser (with special hole tokens in between tokens produced by parts). 

You have basically described exactly what would happen with the parboiled parser, to the letter. I have not made it generate trees, but this is what it would do. $foo and ${foo.bar} are naturally completely independent of quasiquotes: any interpolated string will have these holes. Having the parser clever enough to fill in these holes "directly" rather than leaving stub-nodes and walking the tree later is trivial.

> Overall current low-level trees led to massive and overly complicated implementation that I'm personally not proud of. I believe that tightly instrumented integration between parsing and trees should decrease amount of code needed to implement quasi quotes by ≈10x factor and I'm afraid that such amount of reduction will not be possible if we stick to current approach. 

Agreed

> Unfortunately I don't see how will you be able to use parboiled2 parser to do anything but old and ugly "construct code snippet textually and then parse it as if it was separate compilation unit" trick.

Disagree =P Everything you said was up to this point correct, what you said you want matches exactly what the parboiled parser would provide. 

As you said, we have no specifics, because the parboiled parser does not spit out ASTs right now. You'll need to try out the parboiled parser yourself and see what ASTs you can generate with it. I promise it won't be too hard: I added XML support to the parser (from nothing) in about 5 hours. The things you are asking for are vague and hand-wavy, but ultimately seem pretty straightforward. 

Johannes Rudolph

unread,
Dec 2, 2014, 7:01:23 AM12/2/14
to scala-i...@googlegroups.com
Hi Denys,

as you said, the parser needed for parsing quasi quote strings is
different from a regular Scala parser because you need to parse other
kinds of placeholder directly in Scala syntax (like `..` etc) and your
end-product should be a another (but very similar) kind of AST that
allows holes inside regular Scala expressions.

The good thing is that parboiled parsers are quite composable, so if
you structure it accordingly you will be able to reuse most of it. The
simplest way of reuse would be to include holes directly in a unified
AST and just add a flag that enables or disables hole parsing. A
cleaner solution would be to have two different kinds of ASTs for
Scala code with or without holes (with all the duplication that might
be necessary).

The basic question is how to parse the holes given that parboiled2
works on character based input. I guess the simplest solution would be
to number holes by occurrence in the string and then e.g. replace them
by a character from the Unicode "private use area".

I guess that's basically what you proposed, right?

Johannes

[1] http://en.wikipedia.org/wiki/Private_Use_Areas#Private_Use_Areas
Johannes

-----------------------------------------------
Johannes Rudolph
http://virtual-void.net

Rodrigo Cano

unread,
Dec 2, 2014, 7:31:06 AM12/2/14
to scala-i...@googlegroups.com
@martin odersky and @HaoyiLi regarding Loc/sec a common machine should produce the metrics - if we really care for this - but aside from that, I'd like to say that if you manage to have a parser *fast enough* that it doesn't take any meaningful time in the compiling pipeline, then I would certainly opt for one that is clear and using a standard PEG defining tool like parboiled2 over a hand made one, since it makes it very easy to understand and maintain (again, provided that it is already fast enough), even merely for the fact of it being more standard (though it could be argued that pb2 hasn't reach that lvl of standard for scala).

Denys Shabalin

unread,
Dec 2, 2014, 9:10:09 AM12/2/14
to scala-i...@googlegroups.com
@Johannes

I think you've got what I was trying to explain. Indeed storing information about holes in ASTs is desired outcome but the question is how do we get there.

Using unused unicode characters for holes is an interesting workaround. It still has one drawback of textual code reconstruction I've not mentioned before: if you parse in new parser using freshly generated textual input your positions will not be reported with respect to original source code but will be reported in terms of generated code. To fix this one needs to post-process position data and re-map it between the two, and it's not as simple as shift by offset because holes can take different length in generated and original source code. 

This won't be a problem with parser that works on tokens because parser isn't responsible for position handling, only scanner is. It's trivial to pass tokens with nice positions and no post processing step is required in that case.


Johannes Rudolph

unread,
Dec 2, 2014, 9:22:49 AM12/2/14
to scala-i...@googlegroups.com
On Tue, Dec 2, 2014 at 3:10 PM, Denys Shabalin <denys.s...@epfl.ch> wrote:
> This won't be a problem with parser that works on tokens because parser
> isn't responsible for position handling, only scanner is. It's trivial to
> pass tokens with nice positions and no post processing step is required in
> that case.

Yes, good point. You would need a mapping when using character-based
parboiled2, either by finding an encoding that keeps the original
character length or by providing a somewhat simple translation table
which does the mapping efficiently.

Haoyi Li

unread,
Dec 2, 2014, 1:41:07 PM12/2/14
to scala-internals
Yes, good point. You would need a mapping when using character-based
> parboiled2, either by finding an encoding that keeps the original
> character length or by providing a somewhat simple translation table
> which does the mapping efficiently.

Can't we avoid needing this weird substitution thing entirely if we store the position of each hole? I'm imagining an AST like
case class StringLiteral(str: String, holes: Seq[Hole], start: Int)
case class Hole(tree: Tree, start: Int, end: Int)

class QuasiquoteParser(val input: String, holes: Seq[Hole]) extends Parser{
  val holeMap = holes.map(h => h.start -> h).toMap
  def MaybeHole = 
    if (!holeMap.contains(cursor)) rule( MISMATHCH ) 
    else rule( ANY.times(holeMap(cursor).end - holeMap(cursor).start ~ push(holeMap(cursor)) 

  ... use MaybeHole elsewhere in the grammar anywhere a hole could possibly appear ...
}
That way we'll keep the original Hole ASTs around from the original parser with the original, correct range positions. The non-Hole parts of the quasiquotes will need to have their range-positions fixed, but that'll be simply a matter of adding the start position of the quasiquote, since all the internal relative-positions will be unchanged.

Tony Sloane

unread,
Dec 3, 2014, 4:35:45 PM12/3/14
to scala-i...@googlegroups.com
Hi Martin,

You say that your new dotty parser is very maintainable and I’m interested in exploring that comment.

How do you deal with the interaction between rules? Essentially information such as FIRST sets must be shared, which means that it’s not quite as simple as writing the grammar rules down and then coding them independently. A change to a grammar rule in one place can necessitate a change to a parser somewhere else.

As a concrete example, your grammar rule parser for simple types is:

    /** SimpleType       ::=  SimpleType TypeArgs
     *                     |  SimpleType `#' Id
     *                     |  StableId
     *                     |  Path `.' type
     *                     |  `(' ArgTypes `)'
     *                     |  Refinement
     */
    def simpleType(): Tree = simpleTypeRest {
      if (in.token == LPAREN)
        atPos(in.offset) { makeTupleOrParens(inParens(argTypes())) }
      else if (in.token == LBRACE)
        atPos(in.offset) { RefinedTypeTree(EmptyTree, refinement()) }
      else path(thisOK = false, handleSingletonType) match {
        case r @ SingletonTypeTree(_) => r
        case r => convertToTypeId(r)
      }
    }

First of all, the cases are not obviously derived directly from the relevant RHSes, so the “correctness” of the code is harder to see than if the correspondence was direct.

But putting that aside, there’s a non-trivial step from the grammar rule to the parsing code, mostly due to the need to identify the FIRST sets of the grammar rule RHSes. If you were to change the grammar of Refinement, for example, in a way that affected its FIRST set, you would have to remember to come back to the SimpleType parser and update the second test. Do you not find this to be an issue in practice?

Perhaps it is mostly a function of how often your grammar changes. I would imagine if you are making frequent changes to the syntax, keeping all of this up-to-date would be quite error prone. An advantage of generator-based approaches is that the tool takes care of that for you.

cheers,
Tony

martin odersky

unread,
Dec 3, 2014, 4:49:35 PM12/3/14
to scala-internals
On Wed, Dec 3, 2014 at 10:35 PM, Tony Sloane <inky...@gmail.com> wrote:
Hi Martin,

You say that your new dotty parser is very maintainable and I’m interested in exploring that comment.

How do you deal with the interaction between rules? Essentially information such as FIRST sets must be shared, which means that it’s not quite as simple as writing the grammar rules down and then coding them independently. A change to a grammar rule in one place can necessitate a change to a parser somewhere else.

As a concrete example, your grammar rule parser for simple types is:

    /** SimpleType       ::=  SimpleType TypeArgs
     *                     |  SimpleType `#' Id
     *                     |  StableId
     *                     |  Path `.' type
     *                     |  `(' ArgTypes `)'
     *                     |  Refinement
     */
    def simpleType(): Tree = simpleTypeRest {
      if (in.token == LPAREN)
        atPos(in.offset) { makeTupleOrParens(inParens(argTypes())) }
      else if (in.token == LBRACE)
        atPos(in.offset) { RefinedTypeTree(EmptyTree, refinement()) }
      else path(thisOK = false, handleSingletonType) match {
        case r @ SingletonTypeTree(_) => r
        case r => convertToTypeId(r)
      }
    }

First of all, the cases are not obviously derived directly from the relevant RHSes, so the “correctness” of the code is harder to see than if the correspondence was direct.

But putting that aside, there’s a non-trivial step from the grammar rule to the parsing code, mostly due to the need to identify the FIRST sets of the grammar rule RHSes. If you were to change the grammar of Refinement, for example, in a way that affected its FIRST set, you would have to remember to come back to the SimpleType parser and update the second test. Do you not find this to be an issue in practice?

No. You need to compute FIRST sets anyway to be sure that your grammar is LL(1). Or rather, in practice, it is never LL(1) but you need to know precisely why not, and what remedies you have. A backtracking parser won't help for a production quality compiler. You can't have unchecked backtracking because it might just be too slow. So in practice even with a backtracking parser you need to know about FIRST sets.
 
Perhaps it is mostly a function of how often your grammar changes. I would imagine if you are making frequent changes to the syntax, keeping all of this up-to-date would be quite error prone. An advantage of generator-based approaches is that the tool takes care of that for you.

I would agree that for a fast-changing syntax, a parser generator is preferable. But Scala is way beyond that point. 

Cheers

 - Martin 



--
Martin Odersky
EPFL

Matthew Pocock

unread,
Dec 3, 2014, 5:37:11 PM12/3/14
to scala-i...@googlegroups.com
I think it comes down to this trade-off:

  • grammars are cleaner to read than code
  • code for generating code from grammars may or may not be easier to maintain than hand-rolled parsers for a grammar
  • code that produces optimized implementations of a grammar that run *fast enough* may be sufficiently gnarley to maintain that for a one-off grammar special case, hand-rolling it is the easier path
IMHO, if the parboiled2 output was optimized sufficiently that it came close enough to the hand-rolled parser, and if parboiled2 is likely to survive independently of being a dependency of the scala core, and if the parboiled2 code, after it's been kicked in the shins to produce this sufficiently fast parser is itself maintainable, then there's a case for using that in scalac (and/or successor). Otherwise I can understand the pragmatic choice to hand-roll the parser in scalac.

M

Tony Sloane

unread,
Dec 3, 2014, 6:00:49 PM12/3/14
to scala-i...@googlegroups.com
Yeah, that sounds about right to me. The thing to remember is that there are (at least) three axes: similarity of the “code” to the grammar, parsing method (and its properties) and hand-coded vs generated implementation. E.g., it’s quite possible to have non-backtracking parsers that are generated from nice grammars, as it is possible to have fast, hand-written parsers that have a non-obvious relationship to the grammar.

cheers,
Tony

Sam Halliday

unread,
Dec 10, 2014, 10:19:29 AM12/10/14
to scala-i...@googlegroups.com
Just wanted to say that this looks awesome and we'd love to use it in ENSIME if we can find a good use-case :-D

At the moment, the presentation compiler is giving us everything we need... plus we are retaining 2.9.x compatibility until the next stable.

But for the future... this plus type trees. Oh, yes please :-D

On Saturday, 29 November 2014 10:38:50 UTC, Haoyi Li wrote:
Hey All,

As some of you may know, I have recently spent a lot of time with Parboiled2.

The result of this ~50 hours of effort is scala-parser, a Parboiled2 parser for the Scala programming language syntax. This parser is still not complete (missing XML syntax and unicode escapes) but apart from those two features for which I've explicitly blacklisted files, scala-parser is able to parse every single .scala file in 
  • scala/scala
  • scala-js/scala-js
  • scalaz/scalaz
  • milessabin/shapeless
  • akka/akka
And it's a total of ~700LOC, for everything, including all the un-documented features of Scala syntax that are no-where to be found in the lexical specification (e.g. semicolon insertion & context-sensitivity, string interpolation, key-word/key-operator lookahead requirements, macros, annotation-types-can't-have-symbolic-names, ...). 

Adding the missing XML syntax and unicode-escapes would be about 1-day of work, given how much time I've spent on the parser so far (i.e. not much).

Presumably having a complete, reasonable, lightweight, modular parser (e.g. not the one in scalac) for the language would be of use to a bunch of people. scala-parser currently just identifies grammar and blows up if it can't parse something, but its straightforward to make it spit out an AST or whatever else you want (range positions, etc.). I haven't actively benchmarked performance, but it walks and parses all 5 of those projects in 80 seconds, which seems plenty fast.

At 700 LOC, this is basically compact enough to be an executable specification. The grammar in the Scala spec may be shorter, but it is also wrong and incomplete (to put it mildly) so, and worse, and there's no way forward to ensure it matches reality in any way. i.e. it's all lies and wishful thinking. This one actually works! And you can feed in reams and reams of Scala code to verify it matches reality.

The parser is still not complete, but is pretty close. It would be cool if people could try it out and see if it's useful for anything they're working on, or help find/fix edge cases in the parser.

Hopefully this is useful to someone ^_^ 
Reply all
Reply to author
Forward
0 new messages