Position pairs (start, end) in parser systems.

Rocky Bernstein

unread,

Jun 22, 2010, 6:37:03 AM6/22/10

to nyc-co...@googlegroups.com

Undaunted with the lack of enthusiasm (so far) for the last post let me try one other topic....

One thing that's I've noticed is that most or perhaps the more popular parser systems like Bison, Yacc don't offer much in the way of supporting position pairs for gramma rules -- that is a start location and an end location. I don't know that much about LLVM, but I believe this is the one exception. And this is why is say why MacRuby or Apple compilers are better with respect to associating a range of text with some sort of problem, which could either be a syntactic error or a run-time error. For example say in

a = 0, b = 20; x = b / a ; d = 40

^^^^^ error is here

or where we have:

x =

b

/

a;

which spans several lines.

Adding position pairs in a parser system that doesn't have this built in doesn't strike me as all that difficult to add. There generally is a variable which has the ending "line number" (but byte offset is probably cooler as is "line number + column number" pair.

Individual tokens or terminal symbols generally don't span a line and the ones that do can probably be counted on one hand if not one finger. So given an ending position, on a shift in a LR or bottom-up type system one can easily compute the beginning position for a terminal symbol and store that information as part of the shift action. Then, on a reduce action we have the starting position which can be obtained from with the start position of the first/least recent symbol (terminal or nonterminal) for the reduce action and ending position which can be obtained from the last or top-most/most recent stack symbol (terminal or nonterminal) on the stack.

So it is just a matter of maintaining for a symbol these two syntax-directed attributes and this could be boilerplate code. For LL or top-down parser systems something analogous I believe applies as well.

Anyone have experience with parser systems that save position pairs outside of LLVM? Care to comment of the prevalence or lack thereof?

Kalani Thielen

unread,

Jun 22, 2010, 10:31:46 AM6/22/10

to nyc-co...@googlegroups.com

I've done this with Happy (a YACC-like LALR parser generator for Haskell), although it doesn't involve the parser generator per se but rather the type of monad that your lexer/parser composition works in.

I've written an LALR parser generator (based on the relational lookahead-determination technique that Bison uses) that tracked this as well, but I prefer the Happy approach of making it a matter of the monad you decide rather than as a feature of the generator.

Charlie Robbins

unread,

Jun 22, 2010, 1:15:12 PM6/22/10

to nyc-co...@googlegroups.com

Kalani,

I missed your talk @ Lab49 a couple months ago. Is the source code / slides for the talk open source / available? I'm sure there are a bunch of folks who would like to give that a look over.

Cheers,

Charlie

Kalani Thielen

unread,

Jun 22, 2010, 9:14:02 PM6/22/10

to nyc-co...@googlegroups.com

Hey Charlie,

The compiler from that talk was written in Haskell and it compiles a language similar to the simply-typed lambda calculus (extended with recursive types, existential types, variants, records, and general recursion). The compiler shows how to transform STLC programs step-by-step into valid assembly code, and then package it up in a Windows or Linux executable file.

The STLC language itself would be a little tedious to program in, but it's a reasonable target for a typical compiler front-end (type inference, overloading resolution, polymorphic types, etc can be translated to STLC).

If you're interested, I've put up the source for the compiler here (if you have GHC installed, you can just build the compiler with the 'compile.bat' script there):

http://two5.org/~kthielen/stlc.zip

I've also put up an example of one of the output options of the compiler (it generates control-flow graph and register allocation diagrams as an aid to debugging the various phases of the compiler):

http://two5.org/~kthielen/diag/

I also put up the PowerPoint presentation if you're interested:

http://two5.org/~kthielen/stlcc.pptx

I'd be interested in discussing it and/or hearing comments/criticism if anybody's interested.

-Kalani

PS: I didn't implement a parser generator just for that compiler, but I did write up an implementation of the relational LALR-lookahead approach a while back:

http://blog.lab49.com/archives/2471

Rocky Bernstein

unread,

Jun 23, 2010, 3:13:47 AM6/23/10

to nyc-co...@googlegroups.com

On Tue, Jun 22, 2010 at 10:31 AM, Kalani Thielen <kthi...@gmail.com> wrote:

I've done this with Happy (a YACC-like LALR parser generator for Haskell), although it doesn't involve the parser generator per se but rather the type of monad that your lexer/parser composition works in.

When you say you have "done this", does what exactly does this mean? that the current stlc compiler does this, or that this is common practice, or that you've written a compiler at one time that does this?

I've looked at the slides and the zip file and see no mention of positions anywhere. I do see monads for line numbers mentioned in

http://www.haskell.org/happy/doc/html/sec-monads.html#sec-line-numbers

However that seems to be a single line number. Best of course would be [start column, start line] and [end column, end line] as suggested in the output when I gave

        a / b
        ^^^^

In stlc.y what I see related to errors is this:

lexer (c:cs)
      | isSpace   c = lexer cs
      | isDigit   c = lexNum (c:cs)
      | isSymChar c = lexSym (c:cs)
      | otherwise   = error $ "Unexpected character: '" ++ show c ++ "'."

parseError _ = error "Parse error"

parseError :: [Token] -> a
parseError _ = error "Parse error"

Is there someplace else I should be looking?

Given my vague understanding -- and it is vague -- I could be very wrong.

It sounds like what you are saying is that you have a parser system and a methodology (of using monads attached to tokens) whereby if one wants to capture position information that's possible if you want to code that way.

If that's the case, that is a little different than some sort of packaged setup where this position information code is already bundled or templated and all I have to do is somehow indicate I want it or not, even if by including a particular file that defines the monads for positions.

Kalani Thielen

unread,

Jun 23, 2010, 1:21:53 PM6/23/10

to nyc-co...@googlegroups.com

Both that I've written an LALR parser generator that produces parsers which track extents, and that I've used a monadic lexer/parser in Happy to do the same thing. As for the Happy example, it's true that the example only includes final positions, but because there's a clear point at which the lexer introduces tokens, getting initial positions is trivial as well.

The STLC compiler I mentioned earlier doesn't do this, as the parsing phase was bolted on to get demo programs in (since the language itself is more useful as a target for a fully-featured front end). There are a ton of other things I'd add into a reasonable front-end as well.

Kalani Thielen

unread,

Jul 21, 2010, 10:32:24 PM7/21/10

to nyc-co...@googlegroups.com

Hi guys,

It's been about a month since this thread was last active, but I thought that it might be worthwhile to demonstrate how this kind of feature can be added to an LALR parser in Haskell using a basic state/error monad and basic facts about reduction order in LALR parsers.

I've attached three files -- the real work is done in 'ReadPosTracking.hs' (this defines the monadic machinery needed to make position tracking simple to add to a parser), while 'Reader.y' defines a simple grammar for arithmetic expressions with a couple of very small changes to inject file start/end position information, and finally 'Term.hs' defines the basic term language of arithmetic expressions that can be used for evaluation or to print reduction order using the injected extent information.

If you have the Haskell Platform installed, you can try out the demo by putting the three files in a directory together, and then generating the arithmetic parser at your shell:

> happy Reader.y

Once that's done, you can just load the trifecta up in GHCi at your shell:

> ghci Reader.hs

GHCi, version 6.12.1: http://www.haskell.org/ghc/ :? for help
Loading package ghc-prim ... linking ... done.
Loading package integer-gmp ... linking ... done.
Loading package base ... linking ... done.
Loading package ffi-1.0 ... linking ... done.
[1 of 3] Compiling Term ( Term.hs, interpreted )
[2 of 3] Compiling ReadPosTracking ( ReadPosTracking.hs, interpreted )
[3 of 3] Compiling Reader ( Reader.hs, interpreted )
Ok, modules loaded: Reader, ReadPosTracking, Term.
*Reader>

Now you might try a simple task, like parsing and evaluating a string:

*Reader> eval (readCalc "(1+2+3)*(4+5+6)")
90

Similarly, you can print a diagram of the reduction order of the expression (which uses the file/extent information that's been injected into terms):

*Reader> printCalcTrace "(1+2+3)*(4+5+6)"
(1+2+3)*(4+5+6)
-
(1+2+3)*(4+5+6)
   -
(1+2+3)*(4+5+6)
---
(1+2+3)*(4+5+6)
     -
(1+2+3)*(4+5+6)
-----
(1+2+3)*(4+5+6)
         -
(1+2+3)*(4+5+6)
           -
(1+2+3)*(4+5+6)
         ---
(1+2+3)*(4+5+6)
             -
(1+2+3)*(4+5+6)
         -----
(1+2+3)*(4+5+6)
---------------

I hope that this is useful. In my opinion, the grammar and term files are pretty straightforward and plain-vanilla (as you'd want if you were to easily "merge in" file/extent tracking). Although the machinery of the file/extent tracking monad is a little involved, it's generic enough to reuse in any parser.

Regards,

-Kalani

Reader.y

ReadPosTracking.hs

Term.hs

Rocky Bernstein

unread,

Jul 21, 2010, 10:46:54 PM7/21/10

to nyc-co...@googlegroups.com

Thanks for putting this together the demo.

Reply all

Reply to author

Forward