Hi again,
I was chatting with a few other people here about this problem
(finding a good, generic, parsing mechanism so that an AST can be
formed for many different languages). It was pointed out that not
only is the off-side rule (syntactically important indenting) quite
common and non-context-free, but C++ can be a nightmare, particularly
with the pre-processor.
For the off-side rule, I had a few thoughts. One path was
essentially reinventing the indexed grammar <
http://en.wikipedia.org/
wiki/Indexed_grammar>. In its full generality that isn't efficient,
but it seems the right restricted form may be. I was viewing an
indexed grammar as a context-free grammar with backreferences, but
where the backreferences can also be passed as arguments to the non-
terminals. e.g.
Block(\indent) -> (\indent \w+) Statement(\1) (CR \1 Statement(\1))*
Statement(\indent) -> .... |
if expression : CR Block(\indent)
(Note that here Block(\indent) means a block indented by MORE THAN
\indent, where the Block nonterminal also consumes the space. In
contrast Statement(\indent) means a statement that follows an indent
of \indent, and the indenting is not part of the statement, but rather
is there in to be passed to any nested blocks.)
I don't know if this sort of back-referencing is easy to put into an
LL(*) parser. Full indexed grammars are hard, but this isn't a full
indexed grammar - it only allows back-references to appear as
arguments.
See also <
http://danielmattosroberts.com/earley/context-sensitive-
earley.pdf>.
Cheers,
Will :-}