Combinator parsers that use regexes for tokenization (lexical parsing), and then syntactic parsing based on the regexes?

63 views
Skip to first unread message

Ken McDonald

unread,
May 28, 2011, 8:33:12 PM5/28/11
to scala...@googlegroups.com
The Scala combinators library has a small example of a lexical parser; however, it doesn't use regexes, and it doesn't really show what to do with tokens once one has them. If anyone can point out examplex of simple to moderate combinator parsers that tokenize via regexes, and then parse based on the tokens, I'd appreciate it.

Thanks,
Ken McDonald

Randall R Schulz

unread,
May 28, 2011, 8:50:00 PM5/28/11
to scala...@googlegroups.com

I don't have actual code handy, but for a parser that extends
RegexParsers you can do something like this:


class MyREParser
extends RegexParsers
{
val number = "[0-9]+".r
val name = "[A-Z_a-z][A-Z_a-z0-9]".r

// Here you can use number and name as terminals
// in your combinator parser grammar.
}

By default, white-space will be skipped before attemting to match any
regular-expression terminal (token). You can control that as well as
what constitutes inter-token white-space.


The second edition of Programming In Scala has a RegexParser example in
its Combinator Parsers chapter. I believe the 1st ed. did, as well. And
if you don't know, the 1st. ed. is available free online in HTML [1].

[1] http://www.artima.com/pins1ed/


Randall Schulz

Daniel Sobral

unread,
May 28, 2011, 9:54:22 PM5/28/11
to Randall R Schulz, scala...@googlegroups.com
On Sat, May 28, 2011 at 21:50, Randall R Schulz <rsc...@sonic.net> wrote:
> On Saturday May 28 2011, Ken McDonald wrote:
>> The Scala combinators library has a small example of a lexical
>> parser; however, it doesn't use regexes, and it doesn't really show
>> what to do with tokens once one has them. If anyone can point out
>> examplex of simple to moderate combinator parsers that tokenize via
>> regexes, and then parse based on the tokens, I'd appreciate it.
>
> I don't have actual code handy, but for a parser that extends
> RegexParsers you can do something like this:
>
>
> class MyREParser
> extends RegexParsers

But that isn't a lexer.

--
Daniel C. Sobral

I travel to the future all the time.

Randall R Schulz

unread,
May 28, 2011, 10:24:20 PM5/28/11
to scala...@googlegroups.com

So?

It answers the question "what do do with tokens once one has them."
(Consume them in non-terminal productions, of course.) It clearly is a
(extremely minimal, in fact entirely incomplete) "... parser that
tokenize[s] via regexes..."


Randall Schulz

Ken McDonald

unread,
May 28, 2011, 10:47:28 PM5/28/11
to scala...@googlegroups.com, Randall R Schulz
The CPs that are discussed in the "Programming in Scala" book (the ones that extend RegexParsers) effectively seem to be combining the lexing and parsing phases; this is what Randall did. I'm interested in seeing an example where the two phases are distinct, i.e. where the lexing simply produces a series of tokens, and the parser's productions operate on the tokens (as opposed to RegexParsers, where the productions operate directly on string input). I'd like to see what this looks like, and it might be better for what I'm trying.

Thanks,
Ken

Daniel Sobral

unread,
May 28, 2011, 10:54:26 PM5/28/11
to scala...@googlegroups.com, Randall R Schulz


See scala.utils.parsing.json, particularly Lexer and Parser. Looking
for subclasses on Scaladoc is often useful in such situations.

Randall R Schulz

unread,
May 28, 2011, 11:05:33 PM5/28/11
to scala...@googlegroups.com
On Saturday May 28 2011, Ken McDonald wrote:

Why?


> Thanks,
> Ken


Randall Schulz

Ken McDonald

unread,
May 28, 2011, 11:49:46 PM5/28/11
to scala...@googlegroups.com
Why not? I'm not being facetious--I'd like to see a different way of doing things and evaluate the pros and cons. To my knowledge, parsers have traditionally (or at least, often) been implemented with a tokenizing phase and then a parsing phase, and I'm assuming there are probably reasons for that.

Ken 

Sanjay Dasgupta

unread,
May 29, 2011, 1:14:56 AM5/29/11
to scala...@googlegroups.com, Randall R Schulz
Some time back I'd started work on a version of RegexParsers augmented with real lexical-analysis capabilities. A first version - that works the way you describe - can be found here LexingRegexParser.

The code below shows the changes required to port an existing parser based on StandardTokenParsers to the augmented RegexParsers framework. LexingRegexParsers actually requires all tokens (regex and literal, named or anonymous-inlined) to be defined or declared before use. In addition to overriding RegexParsers' literal(String) and regex(Regex) methods (to work with the embedded lexical-analyzer), it also defines n-ary versions: literal(String*) and regex(Regex*) that facilitate the definition or declaration of tokens (as in the code). 

You can find user-documentation in the wiki-pages at the site.

I hope this helps,

- Sanjay

// import scala.util.parsing.combinator.lexical.StdLexical
import lexpar.LexingRegexParsers
// import scala.util.parsing.combinator.syntactical._
// object OrderDSL extends StandardTokenParsers {
object OrderDSL extends LexingRegexParsers {
  // lexical.delimiters ++= List("(", ")", ",")
  // lexical.reserved += ("buy", "sell", "shares", "at", "max", "min", "for", "trading", "account")
  // Add token declarations / definitions
  literal("(", ")", ",")
  literal("buy", "sell", "shares", "at", "max", "min", "for", "trading", "account")
  val (stringLit, numericLit, ident) = regex("\\\"[^\"\n]*\\\"".r, "[0-9]+".r, "[a-zA-Z][a-zA-Z0-9]*".r)

  def instr = trans ~ account_spec
  def trans = "(" ~> repsep(trans_spec, ",") <~ ")"
  def trans_spec = buy_sell ~ buy_sell_instr
  def account_spec = "for" ~> "trading" ~> "account" ~> stringLit
  def buy_sell = ("buy" | "sell")
  def buy_sell_instr = security_spec ~ price_spec
  def security_spec = numericLit ~ ident ~ "shares"
  def price_spec = "at" ~ ("min" | "max") ~ numericLit
  
  def main(args: Array[String]) {
    import scala.util.parsing.input.CharSequenceReader
    val input = """ (buy 100 IBM shares at max 45, 
      sell 50 CISCO shares at min 25, 
      buy 100 Google shares at max 800) 
      for trading account "SSS1234" """
    println(phrase(instr)(new CharSequenceReader(input)))

Grzegorz Kossakowski

unread,
May 29, 2011, 4:59:03 AM5/29/11
to scala...@googlegroups.com
2011/5/29 Ken McDonald <ykke...@gmail.com>

The CPs that are discussed in the "Programming in Scala" book (the ones that extend RegexParsers) effectively seem to be combining the lexing and parsing phases; this is what Randall did. I'm interested in seeing an example where the two phases are distinct, i.e. where the lexing simply produces a series of tokens, and the parser's productions operate on the tokens (as opposed to RegexParsers, where the productions operate directly on string input). I'd like to see what this looks like, and it might be better for what I'm trying.

http://review.source.gogoego.com/gitweb?p=projects/jribble.git;a=blob;f=src/main/scala/com/google/jribble/Parsers.scala;h=2dfdf8e90e2b695b3cca3f7008a7a709d03860a3;hb=HEAD

--
Grzegorz Kossakowski

Ken McDonald

unread,
May 29, 2011, 11:31:06 AM5/29/11
to scala...@googlegroups.com
Ah, that's a very nice example--it should give me the info I'm looking for. Many thanks!

Ken

Reply all
Reply to author
Forward
0 new messages