I'm trying to define groups explicitly as per the documentation. Right now I try to define a non-noise line block.
This is my grammar:
"Start Symbol" = <Program>
NL = {CR}? {LF}
NL @= { Type = Noise }
Whitespace = ({Space}|{HT})+
LnBlk @= { Type = Content }
LnBlk Start = '--'
LnBlk End = NL
LnBlk Block @= { Ending = Open }
<Program> ::= LnBlk
As you can see, I deliberately don't use the names generated by GOLD for implicit comment group creation; neither the newline nor the group have a "special" name. The tables build fine, but I cannot parse my test line:
-- test
(Note that I do have a newline after the "test" line). Expected would be that I get a LnBlk token.
Digging deeper, I analyzed the symbols table. This reveals something interesting: the NL symbol exists twice, once as "Noise"/"Defined in Grammar", and once as "Lexical Group End"/"Implicitly Defined". I assume that this is the cause of the problem. Note that I'm following the documented example for the "Pascal" block comments as described here:
http://goldparser.org/doc/grammars/example-group.htm
This prompts a general issue with defining end tokens in groups that do not consume the end token: this end token can be any terminal that is supposed to be re-usable otherwise. As such, the "Lexical Group End" type should not be used in this case.
In fact, even if the block consumes the end token it should not be a "Lexical Group End". Think of a grammar which allows arbitrary text blocks like so:
"Start Symbol" = <Statement>
End = 'end'
Begin = 'begin'
Statement = {AlphaNumeric}+
Message @= { Type = Content }
Message Start = 'message:'
Message End = End
Message Block @= { Advance = Character, Ending = Closed }
<Statement> ::= Statement
| Message
| <Block>
<StatementList> ::= <Statement> ';' <StatementList>
|
<Block> ::= Begin <StatementList> End
It fails to generate the DFA states ("Cannot distinguish between: End End"). Okay, so let's not define "end" as terminal and try again...
"Start Symbol" = <Statement>
Statement = {AlphaNumeric}+
Message @= { Type = Content }
Message Start = 'message:'
Message End = end
Message Block @= { Advance = Character, Ending = Closed }
<Statement> ::= Statement
| Message
| <Block>
<StatementList> ::= <Statement> ';' <StatementList>
|
<Block> ::= begin <StatementList> end
Now it creates the tables alright. Inspecting the symbols however show that "end" is a "Lexical Group End", even though it should be a "Content" really. The test however now parses fine (yeeha!):
begin
statement;
message: funky stuff end;
end
So, in summary, I see the following two major issues:
- Explicitly defined terminals cannot be used as group start/end symbols as described in the example. Trying to do so causes the symbol table to contain non-distinct names and leads to DFA table creation errors.
- The "Lexical Group End" type is meaningless and should be completely dropped, so that those symbols are normal terminals of type "Content" or "Noise"
It would be great to have those fixed.
Thanks, Arsène