YAML grammar - parameterized production rules ...

mwno...@gmail.com

unread,

May 24, 2008, 8:04:42 AM5/24/08

to KawaDD Development

I've been wanting to build a YAML parser in JavaCC for a little while
now ... but I get blocked
by the fact that some of the production rules are parameterized:
(http://yaml.org/spec/cvs/current.html)
I didn't get much help on the JavaCC mailing list.

A 'normal' production rule:

[63] c-ns-esc-char ::= "\"
( ns-esc-null | ns-esc-bell
| ns-esc-backspace
| ns-esc-horizontal-tab | ns-
esc-vertical-tab
| ns-esc-line-feed | ns-esc-
form-feed
| ns-esc-carriage-return |
ns-esc-escape | ns-esc-space
| ns-esc-double-quote | ns-
esc-slash | ns-esc-backslash
| ns-esc-next-line | ns-esc-
non-breaking-space
| ns-esc-line-separator | ns-
esc-paragraph-separator
| ns-esc-8-bit | ns-esc-16-
bit | ns-esc-32-bit )

A 'parameterized' production rule - In YAML block styles, structure is
determined by indentation. In general, indentation is defined as a
zero or more space characters at the start of a line. The amount of
indentation is
a presentation detail and must not be used to convey content
information.
[64] s-indent(n) ::= s-space × n
A block style construct is terminated when encountering a line which
is less indented than the construct. The productions use the notation
"s-indent(<n)" and "s-indent(≤n)" to express this.
[65] s-indent(<n) ::= s-space × m /* Where m < n */
[66] s-indent(≤n) ::= s-space × m /* Where m ≤ n */

Now, I didn't want to inject myself into the 'flame war' that caused
JavaCC to fork to become
{J++}awa{CC++} == KawaDD, but now that it has happened, I am
wondering if it might be possible
to use the KawaDD branch to experiment with changes to the underlying
LL(k) parse technology so
that productions like [64], [65] and [66] could be handled.

Other 'strange' YAML productions:

- subtraction
[35] ns-char ::= nb-char - s-white

- multiplication:
[65] s-indent(<n) ::= s-space × m /* Where m < n */
[66] s-indent(≤n) ::= s-space × m /* Where m ≤ n */

- switch:
[68] s-line-prefix(n,c) ::= c = block-out => s-block-line-prefix(n)
c = block-in => s-block-line-
prefix(n)
c = flow-out => s-flow-line-
prefix(n)
c = flow-in => s-flow-line-
prefix(n)

Any help would be appreciated,

Mike Norman

Jonathan Revusky

unread,

May 24, 2008, 10:51:25 AM5/24/08

to kawadd...@googlegroups.com

2008/5/24 mwno...@gmail.com <mwno...@gmail.com>:

Hi Mike,

Thanks for joining the list. You're the first person here besides me! :-)

Normally, yes, though there is no law against whitespace being
meaningful. I guess YAML is very influenced by Python.

Offhand, here is what I might do to attack the problem.

I would use the COMMON_TOKEN_ACTION=true option so that my
CommonTokenAction(Token t) method was called after every token was
recognized. And in there, I would massage the token stream that goes
to the parser. Your lexer needs to track the indent/dedent
information.

One trick that does not seem documented that I have often used to
great benefit s to create a phony lexical state. Supppose you create a
phony lexical state as follows. It's phony because the lexer will
never be switched into that state. However, you can still use the
token types you define in there.

<PHONY> TOKEN : {
<OPEN_BLOCK : "{">
|
<CLOSE_BLOCK : "}"
}

So basically your CommonTokenAction method is a hook into the
machinery that tracks the indent/dedent stuff. To track the
indent/dedent levels you use the
beginLine/beginColumn/endColumn/endLine fields in the token. So your
CommonTokenAction has logic a bit like this:

void CommonTokenAction(Token tok) {
......
if (weHaveAnIndent()) {
tok.kind = OPEN_BLOCK;
input_stream.backup(tok.image.length())
}
.....
}

You see what I mean by the code above? You have some logic that
identifies whether this is an indent that has to be dealt with and if
so, it changes the tok.kind field to OPEN_BLOCK and that's the token
type the parser actually gets. Of course, then you need to back up the
input_stream so that the real token gets fed through on the next
iteration, right? And you pull the same trick with a dedent and change
the tok.kind to CLOSE_BLOCK.

Then in your parser actions, you just use:

<OPEN_BLOCK>
blah blah blah
</CLOSE_BLOCK>

as if you had a "normal" language that uses { and } to open and close
scope, right? Basically, this approach of massaging the token stream
is that you convert a problem you don't quite know how to solve into a
problem that you do know how to solve.

> [64] s-indent(n) ::= s-space × n
> A block style construct is terminated when encountering a line which
> is less indented than the construct. The productions use the notation
> "s-indent(<n)" and "s-indent(≤n)" to express this.
> [65] s-indent(<n) ::= s-space × m /* Where m < n */
> [66] s-indent(≤n) ::= s-space × m /* Where m ≤ n */

Yes, this is based on python syntax. Actually, it occurs to me that
the jython project, Python implemented in Java, uses JavaCC, so you
could look at their grammar. Actually, they don't take the approach I
outlined above. But they do make extensive use of lexical states. I
attached the python.jjt file since I saw that I had it lying around.
It might be quite instructive to see how the Jython people handled
this stuff using JavaCC.

>
> Now, I didn't want to inject myself into the 'flame war' that caused
> JavaCC to fork to become
> {J++}awa{CC++} == KawaDD, but now that it has happened, I am
> wondering if it might be possible
> to use the KawaDD branch to experiment with changes to the underlying
> LL(k) parse technology so
> that productions like [64], [65] and [66] could be handled.

Well, first of all, I can't resist pointing out that the "flame war"
is not based on any legitimate differences with them regarding a road
map for future development. It is simply that they have no road map
whatsoever.

I don't have the sense that there is really a need to change the LL(k)
parse technology to handle this case. I get the feeling it is best
handled at the lexer tokenization level. One approach that should work
IMO is that you massage the indent/dedent offsets into a
OPEN_BLOCK/CLOSE_BLOCK tokens to be seen at the parser level. Once you
handle the indent/dedent part of the grammar, I would guess that most
of the other constructs won't be such a big deal, since they are
typical of any number of programming languages.

So maybe the thing to do is just define an absurdly trivial grammar
that parses some language consisting of an arbitrary amount of xxxx
with indents and dedents interspersed and turns that into xxxx with
the right { and } phony tokens interpsersed. Once you have that, you
ought to be able to attack the rest of the grammar as if it was a
normal programming language that used { and } to delimit blocks.

In any case, I am still really in the midst of a huge refactoring of
the codebase and can only envisage new features after that is
basically completed.

JR

python.jjt

mwno...@gmail.com

unread,

May 24, 2008, 2:11:22 PM5/24/08

to KawaDD Development

[MWN - some lines deleted, edited ...]

On May 24, 10:51 am, "Jonathan Revusky" <revu...@gmail.com> wrote:
> 2008/5/24 mwnor...@gmail.com <mwnor...@gmail.com>:

>
> Hi Mike,
>
> Thanks for joining the list. You're the first person here besides me! :-)
>

Woo-hoo!

I think I get it - some of the 'weirdness' of the productions can be
handled
by massaging the token stream into 'phoney' states that have the same
semantic as the original production. I'll have to think *hard* about
this ...

>
> > Now, I didn't want to inject myself into the 'flame war'
>

> Well, first of all, I can't resist pointing out that the "flame war" is
> not based on any legitimate differences

flame == heat - and heat was generated!

My interest is both technical and sentimental - Sreeni and I both
worked at WebGain (not that our jobs overlapped), so I have a 'soft
spot'
for JavaCC vs. other parsers. I also like not having to rely on a
runtime (unlike ANTLR).

I guess now that I'm #2 on the mailing list, I have a new 'soft spot'
-
hopefully not in my head ;-)

> I don't have the sense that there is really a need to change the LL(k)
> parse technology to handle this case. I get the feeling it is best
> handled at the lexer tokenization level.
>

> So maybe the thing to do is just define an absurdly trivial grammar
> that parses some language consisting of an arbitrary amount of xxxx
> with indents and dedents interspersed and turns that into xxxx with
> the right { and } phony tokens interpsersed. Once you have that, you
> ought to be able to attack the rest of the grammar as if it was a
> normal programming language that used { and } to delimit blocks.
>
> In any case, I am still really in the midst of a huge refactoring of
> the codebase and can only envisage new features after that is
> basically completed.
>
> JR

Fair enough.

Jonathan Revusky

unread,

May 24, 2008, 6:46:56 PM5/24/08

to kawadd...@googlegroups.com

On Sat, May 24, 2008 at 8:11 PM, mwno...@gmail.com <mwno...@gmail.com> wrote:
>
> [MWN - some lines deleted, edited ...]
>
> On May 24, 10:51 am, "Jonathan Revusky" <revu...@gmail.com> wrote:
>> 2008/5/24 mwnor...@gmail.com <mwnor...@gmail.com>:
>>
>> Hi Mike,
>>
>> Thanks for joining the list. You're the first person here besides me! :-)
>>

Okay, I was thinking about this and I just wanted to see how hard it
was to attack this the way I proposed.

You know, I suggested that you write a little trivial grammar that
just had one kind of token and tracked the indent/dedent stuff. It was
a little harder than I thought but I think it might be a good
approach. Though maybe it's not the only way to solve this.

The attached grammar really only has one kind of token, which is
<WORD>. There is also a phony lexical state called phony that defines
two kinds of tokens INDENT and DEDENT.

So, we have a dead simple grammar two productions, the root and a
block. Basically, the input is valid if the indents and dedents
balance. So, you know.

foo
bar
baz

is okay and so is:

foo
bar
baz

but:

foo
bar
baz

is invalid obviously because the indents and dedents don't balance.
This is implemented via the CommonTokenAction mechanism.

So, I hope it's useful to you. Not that this sets an absolute
precedent for how helpful I'll be on the list in the future. But since
you're subscriber number 1, you get the early bird special, I guess.

This whole technique her is kind of ad hoc. Maybe there is some way of
formalizing this token-massaging sort of approach. I'll have to think
about it. Meanwhile, this ought to get you going in the right
direction for now.

Cheers,

JR

IndentDedent.jj

Reply all

Reply to author

Forward