Hi Mike,
Thanks for joining the list. You're the first person here besides me! :-)
Normally, yes, though there is no law against whitespace being
meaningful. I guess YAML is very influenced by Python.
Offhand, here is what I might do to attack the problem.
I would use the COMMON_TOKEN_ACTION=true option so that my
CommonTokenAction(Token t) method was called after every token was
recognized. And in there, I would massage the token stream that goes
to the parser. Your lexer needs to track the indent/dedent
information.
One trick that does not seem documented that I have often used to
great benefit s to create a phony lexical state. Supppose you create a
phony lexical state as follows. It's phony because the lexer will
never be switched into that state. However, you can still use the
token types you define in there.
<PHONY> TOKEN : {
<OPEN_BLOCK : "{">
|
<CLOSE_BLOCK : "}"
}
So basically your CommonTokenAction method is a hook into the
machinery that tracks the indent/dedent stuff. To track the
indent/dedent levels you use the
beginLine/beginColumn/endColumn/endLine fields in the token. So your
CommonTokenAction has logic a bit like this:
void CommonTokenAction(Token tok) {
......
if (weHaveAnIndent()) {
tok.kind = OPEN_BLOCK;
input_stream.backup(tok.image.length())
}
.....
}
You see what I mean by the code above? You have some logic that
identifies whether this is an indent that has to be dealt with and if
so, it changes the tok.kind field to OPEN_BLOCK and that's the token
type the parser actually gets. Of course, then you need to back up the
input_stream so that the real token gets fed through on the next
iteration, right? And you pull the same trick with a dedent and change
the tok.kind to CLOSE_BLOCK.
Then in your parser actions, you just use:
<OPEN_BLOCK>
blah blah blah
</CLOSE_BLOCK>
as if you had a "normal" language that uses { and } to open and close
scope, right? Basically, this approach of massaging the token stream
is that you convert a problem you don't quite know how to solve into a
problem that you do know how to solve.
> [64] s-indent(n) ::= s-space × n
> A block style construct is terminated when encountering a line which
> is less indented than the construct. The productions use the notation
> "s-indent(<n)" and "s-indent(≤n)" to express this.
> [65] s-indent(<n) ::= s-space × m /* Where m < n */
> [66] s-indent(≤n) ::= s-space × m /* Where m ≤ n */
Yes, this is based on python syntax. Actually, it occurs to me that
the jython project, Python implemented in Java, uses JavaCC, so you
could look at their grammar. Actually, they don't take the approach I
outlined above. But they do make extensive use of lexical states. I
attached the python.jjt file since I saw that I had it lying around.
It might be quite instructive to see how the Jython people handled
this stuff using JavaCC.
>
> Now, I didn't want to inject myself into the 'flame war' that caused
> JavaCC to fork to become
> {J++}awa{CC++} == KawaDD, but now that it has happened, I am
> wondering if it might be possible
> to use the KawaDD branch to experiment with changes to the underlying
> LL(k) parse technology so
> that productions like [64], [65] and [66] could be handled.
Well, first of all, I can't resist pointing out that the "flame war"
is not based on any legitimate differences with them regarding a road
map for future development. It is simply that they have no road map
whatsoever.
I don't have the sense that there is really a need to change the LL(k)
parse technology to handle this case. I get the feeling it is best
handled at the lexer tokenization level. One approach that should work
IMO is that you massage the indent/dedent offsets into a
OPEN_BLOCK/CLOSE_BLOCK tokens to be seen at the parser level. Once you
handle the indent/dedent part of the grammar, I would guess that most
of the other constructs won't be such a big deal, since they are
typical of any number of programming languages.
So maybe the thing to do is just define an absurdly trivial grammar
that parses some language consisting of an arbitrary amount of xxxx
with indents and dedents interspersed and turns that into xxxx with
the right { and } phony tokens interpsersed. Once you have that, you
ought to be able to attack the rest of the grammar as if it was a
normal programming language that used { and } to delimit blocks.
In any case, I am still really in the midst of a huge refactoring of
the codebase and can only envisage new features after that is
basically completed.
JR
Okay, I was thinking about this and I just wanted to see how hard it
was to attack this the way I proposed.
You know, I suggested that you write a little trivial grammar that
just had one kind of token and tracked the indent/dedent stuff. It was
a little harder than I thought but I think it might be a good
approach. Though maybe it's not the only way to solve this.
The attached grammar really only has one kind of token, which is
<WORD>. There is also a phony lexical state called phony that defines
two kinds of tokens INDENT and DEDENT.
So, we have a dead simple grammar two productions, the root and a
block. Basically, the input is valid if the indents and dedents
balance. So, you know.
foo
bar
baz
is okay and so is:
foo
bar
baz
but:
foo
bar
baz
is invalid obviously because the indents and dedents don't balance.
This is implemented via the CommonTokenAction mechanism.
So, I hope it's useful to you. Not that this sets an absolute
precedent for how helpful I'll be on the list in the future. But since
you're subscriber number 1, you get the early bird special, I guess.
This whole technique her is kind of ad hoc. Maybe there is some way of
formalizing this token-massaging sort of approach. I'll have to think
about it. Meanwhile, this ought to get you going in the right
direction for now.
Cheers,
JR