Recursive descent parsers

24 views
Skip to first unread message

Toby Corkindale

unread,
May 21, 2013, 12:43:52 AM5/21/13
to Scala Melbourne
Hi,
I've been playing around with Scala's combinatorial parsers, and have
found them very powerful. I like them!

However I have struggled with one thing -- how to programatically
define some elements of the parser.
For eg, in the following (naive) command parser, I have hardcoded a
bunch of strings into a couple of places. How can I replace them with
a List[String] that I pass into the class during runtime?

Eg. to go from this:

def verb: Parser[Verb] = ("start" | "shutdown" | "reboot" |
"halt" | "reload") ^^ {
s => Verb(s)
}

To something like:

val possible_commands = List("start", "stop")
def verb: Parser[Verb] = (possible_commands) ^^ {
s => Verb(s)
}

I've read quite a few docs but haven't picked up what is required,
apart from one which seemed to be explaining that I needed to write a
function that accepts a character stream and returns matches and the
remainder of the stream.. which seems plausible except I became lost
before working out HOW to write it :)

I'll attach the full source of the demo-parser below.
-Toby
simple.scala

Ben Hutchison

unread,
May 21, 2013, 1:06:30 AM5/21/13
to scala...@googlegroups.com
Hi Toby,

I haven't got an opportunity to try a solution out in repl, but keep
in mind that combinator parser operators like '|' are just regular
methods invoked on the left operand, albeit with a funny name. So
these are equivalent:

"start" | "shutdown" | "reboot"

"start".|("shutdown").|("reboot")

So if you wanted to pass a collection of those command strings,
instead of hardcoding, I think you want to reduce() over them with the
'|' operator.

-Ben
> --
> You received this message because you are subscribed to the Google Groups "Melbourne Scala User Group" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to scala-melb+...@googlegroups.com.
> To post to this group, send an email to scala...@googlegroups.com.
> Visit this group at http://groups.google.com/group/scala-melb?hl=en-GB.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

Andrew Conway

unread,
May 21, 2013, 1:55:58 AM5/21/13
to scala...@googlegroups.com

As well as the reduce that Ben mentioned, you need to apply the "literal" function which is implicitly called in the original source. The following compiles (though I haven't run it). You don't need all the intermediate variables - I have just added them so that you can see the intermediate types.

case class Verb(s:String)
class TestParse extends RegexParsers {
    def verbOld: Parser[Verb] = ("start" | "shutdown" | "reboot" |

        "halt" | "reload") ^^ {
             s => Verb(s)
         }

     val possible_commands = List("start", "stop")
     val fred : Parser[String] = "fred" // uses implicit function literal
     val possible_commands_parsers : List[Parser[String]] = possible_commands.map{literal(_)}
     val possible_commands_parser : Parser[String] = possible_commands_parsers.reduce(_ | _)
     def verb: Parser[Verb] = possible_commands_parser ^^ { s => Verb(s) }

}


I really like the parser combinator library. Most java code I convert to scala has a factor of 2-3 line reduction; parsing code using one of the yacc type libraries goes down by a factor of 10 or so (but is much slower).

Toby Corkindale

unread,
May 21, 2013, 2:28:49 AM5/21/13
to Scala Melbourne
Thank you, Andrew and Ben.
The method using map{literal(_)}.reduce(_ | ) works nicely.

By the way, is anyone else going to the YOW JVM/Clojure night tonight?
I'll be there.

Cheers,
Toby
> --
> You received this message because you are subscribed to the Google Groups
> "Melbourne Scala User Group" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to scala-melb+...@googlegroups.com.
> To post to this group, send an email to scala...@googlegroups.com.
> Visit this group at http://groups.google.com/group/scala-melb?hl=en-GB.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>



--
Turning and turning in the widening gyre
The falcon cannot hear the falconer
Things fall apart; the center cannot hold
Mere anarchy is loosed upon the world

Ben Hutchison

unread,
May 21, 2013, 3:18:55 AM5/21/13
to scala...@googlegroups.com
Hi Toby, Andrew,

A small note in scala style. () enclose a single expression, {}
enclose a multipart expression, ie something with multiple lines or
semicolons. So IMO it's slightly clearer to write it:
map(literal(_)).reduce(_ | )

-Ben

Andrew Conway

unread,
May 21, 2013, 10:49:00 PM5/21/13
to scala...@googlegroups.com
Interesting. I hadn't heard (or registered) that style rule.

It's too late for me though - I have adopted a different style rule that has advantages and disadvantages. I often end up with quite long lines in collection manipulation with lots of brackets that can be hard to match. I put arguments that are functions inside {} rather than (). Then I have different brackets interspersed, which are easier (for me) to match when reading. Of course, it doesn't always work - sometimes I need curly braces for other things, and it doesn't work as nicely for functions that take multiple methods, although the latter is rare.

Incidentally, a while ago in one of your talks you showed some unicode character in a scala source file. Up till then I hadn't realised it was possible. Since then, I have made use of the large number of different unicode bracket types when writing toString methods, again for ease in bracket matching when toString produces complex output. It was particularly useful when debugging a parser that had lots of nested brackets in the parsed representation of a formula.

I also hadn't really registered that you could do .reduce(_ | )  instead of .reduce(_ | _)

Toby Corkindale

unread,
May 21, 2013, 11:09:01 PM5/21/13
to Scala Melbourne
I've tended to use Ben's suggestion in my own work; I'm not sure where
I picked it up from, but it feels "right" to me.
Although possibly that's just because of similarities with other
languages I use that use curly braces to denote code blocks too.

As far as style and readability goes, I'm a strong believe in breaking
up long lines.
If your line is so long and complex that you needed to break out the
unicode characters, then I'd suggest a better alternative would be to
switch to newlines between operations instead. Also, replace the
internal mini-blocks with the name of a lexical function created
specifically for it and placed nearby.

For example, before:

def thingy = foo.map(a => a*f(a)).map("widget_" +
_).map(_.toChars.length).reduce(_+)

Could be changed to something like:

def fa(a: Thing) = a*f(a)
def thingy =
foo map(fa)
map("widget_" + _)
map(_.toChars.length)
reduce(_+)

(Although maybe you need to leave trailing dots in there to cope with
the scala compiler's end-of-statement heuristic)

I don't know how that goes down with everyone else's idea of style
though.. It does feel far more readable to me though, especially as
the length of the original statement increases.

Ben Hutchison

unread,
May 21, 2013, 11:32:40 PM5/21/13
to scala...@googlegroups.com
BTW the () vs {} thing is more than a stylistic convention, its syntax.

() can enclose one expression. {} can enclose one or more semi-colon
delimited* expressions, with the final expression being the value of
the whole block.

* Yes, expressions in Scala code are all separated by semi-colons. But
due to semi-colon inference on line break, you normally don't see them
;)

The advantage of Andrew's style (ie use {} for any function literal)
is that you can add/remove expressions from the function body without
having to re-work the punctuation. I don't use the style myself, but
that is a definite benefit (note to self: maybe I should?).

-Ben

-Ben

Andrew Conway

unread,
May 21, 2013, 11:51:31 PM5/21/13
to scala...@googlegroups.com


On Wednesday, 22 May 2013 13:09:01 UTC+10, Toby Corkindale wrote:

If your line is so long and complex that you needed to break out the
unicode characters, then I'd suggest a better alternative would be to
switch to newlines between operations instead.

The uncode brackets were just for the output of toString methods for debugging, not for code that I was actually writing, and used when printing out tree like structures.

For code that I was writing, certainly I would break up anything messy enough to warrant more than two bracket styles. But there are trade offs here too. If you make a messy inline function an explicit function, then when reading it you don't get the instant information that it is only used in one line, which is useful. You can deal with this (and I sometimes do) by inserting blocks of code in gratuitous {}

e.g.
  val w1 = ...
  val w2 = f(messy expr 1,messy expr 2)
  val w3 = ...
becomes
  val w1 = ...
  val me1 = messy expr 1
  val me2 = messy expr 2
  val w2 = f(me1,me2)
  val w3 = ...
which pollutes the name space with me1 and me2 (and possibly delays GC of them) which can be fixed via
  val w1 = ...
  val w2 = {
    val me1 = messy expr 1
    val me2 = messy expr 2
    f(me1,me2)
  }
  val w3
but now you have introduced 4 new lines, one new lexical scope, two new names, all of which slightly reduce the readability... so it is a hard trade off, particularly if the main idea of the algorithm function is not the computation of w2 - it makes w2 seem more significant which can confuse the reader.

Incidentally, I often do a similar thing with expressions that have side effects (putting it inside some lexical scope - e.g.

  ...
  println(messyexp)
  ...
->
  ...
  val _ = {
     val msg = messyexp  // generally only worth doing if this actually becomes multiple lines
     println(msg)
  }
  ...
 
 Now, the "val _ =" is I believe needed to prevent Scala from thinking that the code is not applied to the previous line (eg. if the previous line ended with a constructor - this gives a bizarre error message), but it is still somewhat of a hack. Does anyone have better ideas? (of course you could make a new function, but that is even more obtrusive).

I think deciding when to split a line is one of the big challenges in producing readable programs. One indication of the problem is the popularity of "chaining" in jQuery (e.g. selector.f1().f2().f3() where f1, f2 and f3 are all identity functions with a side effect - apart from its conciseness, it is a dreadful style IMO since it superficially appears semantically different to selector.f1(); selector.f2(); selector.f3())

Reply all
Reply to author
Forward
0 new messages