Combinator parsers that use regexes for tokenization (lexical parsing), and then syntactic parsing based on the regexes?

Ken McDonald

unread,

May 28, 2011, 8:33:12 PM5/28/11

to scala...@googlegroups.com

The Scala combinators library has a small example of a lexical parser; however, it doesn't use regexes, and it doesn't really show what to do with tokens once one has them. If anyone can point out examplex of simple to moderate combinator parsers that tokenize via regexes, and then parse based on the tokens, I'd appreciate it.

Thanks,

Ken McDonald

Randall R Schulz

unread,

May 28, 2011, 8:50:00 PM5/28/11

to scala...@googlegroups.com

I don't have actual code handy, but for a parser that extends
RegexParsers you can do something like this:

class MyREParser
extends RegexParsers
{
val number = "[0-9]+".r
val name = "[A-Z_a-z][A-Z_a-z0-9]".r

// Here you can use number and name as terminals
// in your combinator parser grammar.
}

By default, white-space will be skipped before attemting to match any
regular-expression terminal (token). You can control that as well as
what constitutes inter-token white-space.

The second edition of Programming In Scala has a RegexParser example in
its Combinator Parsers chapter. I believe the 1st ed. did, as well. And
if you don't know, the 1st. ed. is available free online in HTML [1].

[1] http://www.artima.com/pins1ed/

Randall Schulz

Daniel Sobral

unread,

May 28, 2011, 9:54:22 PM5/28/11

to Randall R Schulz, scala...@googlegroups.com

On Sat, May 28, 2011 at 21:50, Randall R Schulz <rsc...@sonic.net> wrote:
> On Saturday May 28 2011, Ken McDonald wrote:
>> The Scala combinators library has a small example of a lexical
>> parser; however, it doesn't use regexes, and it doesn't really show
>> what to do with tokens once one has them. If anyone can point out
>> examplex of simple to moderate combinator parsers that tokenize via
>> regexes, and then parse based on the tokens, I'd appreciate it.
>

> I don't have actual code handy, but for a parser that extends
> RegexParsers you can do something like this:
>
>
> class MyREParser
> extends RegexParsers

But that isn't a lexer.

--
Daniel C. Sobral

I travel to the future all the time.

Randall R Schulz

unread,

May 28, 2011, 10:24:20 PM5/28/11

to scala...@googlegroups.com

So?

It answers the question "what do do with tokens once one has them."
(Consume them in non-terminal productions, of course.) It clearly is a
(extremely minimal, in fact entirely incomplete) "... parser that
tokenize[s] via regexes..."

Randall Schulz

Ken McDonald

unread,

May 28, 2011, 10:47:28 PM5/28/11

to scala...@googlegroups.com, Randall R Schulz

The CPs that are discussed in the "Programming in Scala" book (the ones that extend RegexParsers) effectively seem to be combining the lexing and parsing phases; this is what Randall did. I'm interested in seeing an example where the two phases are distinct, i.e. where the lexing simply produces a series of tokens, and the parser's productions operate on the tokens (as opposed to RegexParsers, where the productions operate directly on string input). I'd like to see what this looks like, and it might be better for what I'm trying.

Thanks,

Ken

Daniel Sobral

unread,

May 28, 2011, 10:54:26 PM5/28/11

to scala...@googlegroups.com, Randall R Schulz

See scala.utils.parsing.json, particularly Lexer and Parser. Looking
for subclasses on Scaladoc is often useful in such situations.

Randall R Schulz

unread,

May 28, 2011, 11:05:33 PM5/28/11

to scala...@googlegroups.com

On Saturday May 28 2011, Ken McDonald wrote:

Why?

> Thanks,
> Ken

Randall Schulz

Ken McDonald

unread,

May 28, 2011, 11:49:46 PM5/28/11

to scala...@googlegroups.com

Why not? I'm not being facetious--I'd like to see a different way of doing things and evaluate the pros and cons. To my knowledge, parsers have traditionally (or at least, often) been implemented with a tokenizing phase and then a parsing phase, and I'm assuming there are probably reasons for that.

Ken

Sanjay Dasgupta

unread,

May 29, 2011, 1:14:56 AM5/29/11

to scala...@googlegroups.com, Randall R Schulz

Some time back I'd started work on a version of RegexParsers augmented with real lexical-analysis capabilities. A first version - that works the way you describe - can be found here LexingRegexParser.

The code below shows the changes required to port an existing parser based on StandardTokenParsers to the augmented RegexParsers framework. LexingRegexParsers actually requires all tokens (regex and literal, named or anonymous-inlined) to be defined or declared before use. In addition to overriding RegexParsers' literal(String) and regex(Regex) methods (to work with the embedded lexical-analyzer), it also defines n-ary versions: literal(String*) and regex(Regex*) that facilitate the definition or declaration of tokens (as in the code).

You can find user-documentation in the wiki-pages at the site.

I hope this helps,

- Sanjay

// import scala.util.parsing.combinator.lexical.StdLexical

import lexpar.LexingRegexParsers

// import scala.util.parsing.combinator.syntactical._

// object OrderDSL extends StandardTokenParsers {

object OrderDSL extends LexingRegexParsers {

// lexical.delimiters ++= List("(", ")", ",")

// lexical.reserved += ("buy", "sell", "shares", "at", "max", "min", "for", "trading", "account")

// Add token declarations / definitions

literal("(", ")", ",")

literal("buy", "sell", "shares", "at", "max", "min", "for", "trading", "account")

val (stringLit, numericLit, ident) = regex("\\\"[^\"\n]*\\\"".r, "[0-9]+".r, "[a-zA-Z][a-zA-Z0-9]*".r)

def instr = trans ~ account_spec

def trans = "(" ~> repsep(trans_spec, ",") <~ ")"

def trans_spec = buy_sell ~ buy_sell_instr

def account_spec = "for" ~> "trading" ~> "account" ~> stringLit

def buy_sell = ("buy" | "sell")

def buy_sell_instr = security_spec ~ price_spec

def security_spec = numericLit ~ ident ~ "shares"

def price_spec = "at" ~ ("min" | "max") ~ numericLit

def main(args: Array[String]) {

import scala.util.parsing.input.CharSequenceReader

val input = """ (buy 100 IBM shares at max 45,

sell 50 CISCO shares at min 25,

buy 100 Google shares at max 800)

for trading account "SSS1234" """

println(phrase(instr)(new CharSequenceReader(input)))

Grzegorz Kossakowski

unread,

May 29, 2011, 4:59:03 AM5/29/11

to scala...@googlegroups.com

2011/5/29 Ken McDonald <ykke...@gmail.com>

The CPs that are discussed in the "Programming in Scala" book (the ones that extend RegexParsers) effectively seem to be combining the lexing and parsing phases; this is what Randall did. I'm interested in seeing an example where the two phases are distinct, i.e. where the lexing simply produces a series of tokens, and the parser's productions operate on the tokens (as opposed to RegexParsers, where the productions operate directly on string input). I'd like to see what this looks like, and it might be better for what I'm trying.

http://review.source.gogoego.com/gitweb?p=projects/jribble.git;a=blob;f=src/main/scala/com/google/jribble/Parsers.scala;h=2dfdf8e90e2b695b3cca3f7008a7a709d03860a3;hb=HEAD

--
Grzegorz Kossakowski

Ken McDonald

unread,

May 29, 2011, 11:31:06 AM5/29/11

to scala...@googlegroups.com

Ah, that's a very nice example--it should give me the info I'm looking for. Many thanks!

Ken

Reply all

Reply to author

Forward