How to avoid lexical keywords

46 views
Skip to first unread message

Michael Powell

unread,
Feb 14, 2019, 4:15:21 PM2/14/19
to antlr-discussion
Hello,

I've got some test fixtures which generate things like randomized Identifiers or Identities ("ident"), but I am actually finding that, plausibly, they are sometimes colliding with lexical keywords. Which to the lexer's credit, are being tokenized as such, but which then fail to match is "legitimate" identifiers.

Really, I think it is not unrealistic to avoid those keywords if at all possible.

SO, ..., I wonder, is it possible for the lexer to furnish a list of these keywords? Is that even a thing?

I see in the C# Antlr4 generated code that these are somewhat loosely identified according to rule name, which is only gets me part of the way. Maybe I could examine the literal names, but I have some doubts there as well.

Or am I left to manually collate the keywords and such?

Best regards,

Michael Powell

Kevin Cummings

unread,
Feb 14, 2019, 5:30:30 PM2/14/19
to antlr-di...@googlegroups.com
I dealt with this by changing all references to IDENT in the parser to identifier.  Then I added a new parser rule:

identifier:
    IDENT
    | <list of reserved names you wish to treat as IDENTS>
    ;

This works like a champ for me in PCCTS, ANTLR-2, and ANTLR-3.  I haven’t rebuilt any of those parsers for ANTLR-4.

Best regards,

Michael Powell

--
Kevin J. Cummings
Registered Linux User #1232


Michael Powell

unread,
Feb 14, 2019, 5:49:18 PM2/14/19
to antlr-discussion


On Thursday, February 14, 2019 at 5:30:30 PM UTC-5, Kevin Cummings wrote:
 On Feb 14, 2019, at 16:15, Michael Powell <mwpow...@gmail.com> wrote:

Hello,

I've got some test fixtures which generate things like randomized Identifiers or Identities ("ident"), but I am actually finding that, plausibly, they are sometimes colliding with lexical keywords. Which to the lexer's credit, are being tokenized as such, but which then fail to match is "legitimate" identifiers.

Really, I think it is not unrealistic to avoid those keywords if at all possible.

SO, ..., I wonder, is it possible for the lexer to furnish a list of these keywords? Is that even a thing?

I see in the C# Antlr4 generated code that these are somewhat loosely identified according to rule name, which is only gets me part of the way. Maybe I could examine the literal names, but I have some doubts there as well.

Or am I left to manually collate the keywords and such?

I dealt with this by changing all references to IDENT in the parser to identifier.  Then I added a new parser rule:

identifier:
    IDENT
    | <list of reserved names you wish to treat as IDENTS>
    ;

Well, actually, I want to preclude the reserved words appearing in the idents, on the one hand...

One the other hand, maybe reserved words makes sense in an identifier, under some circumstances, in which case the above.

So, for the first case, I'm not sure of an easy way to tell the lexer/parser, "this word is a keyword", apart from the crude sort of bookkeeping that is happening while generating the lexer, that is. Apart from defining the verbiage, punctuation, symbols, etc, already happening in my lexer rules.

Thanks for the feedback!

Mike Lischke

unread,
Feb 15, 2019, 2:57:35 AM2/15/19
to antlr-discussion
I've got some test fixtures which generate things like randomized Identifiers or Identities ("ident"), but I am actually finding that, plausibly, they are sometimes colliding with lexical keywords. Which to the lexer's credit, are being tokenized as such, but which then fail to match is "legitimate" identifiers.

Really, I think it is not unrealistic to avoid those keywords if at all possible.

SO, ..., I wonder, is it possible for the lexer to furnish a list of these keywords? Is that even a thing?

The lexer has a vocabulary, which contains all lexer tokens. Is that what you are looking for?


Michael Powell

unread,
Feb 15, 2019, 1:38:54 PM2/15/19
to antlr-discussion
Well, what is being generated for C# is not especially useful to me for what I am wanting to accomplish. Here are some Lexer snippets, for instance. Snippets intentionally stripped down a bit for example only:

public const int
CLOSE_SQUARE_BRACKET=14, COMMA=15, DEFAULT=16, DOT=17, ENUM=18, EOS=19, 
PUBLIC=37, REPEATED=38, REQUIRED=39, RESERVED=40, SIGN=41, SYNTAX=42, 
OCT_LIT=63, DEC_LIT=64, INFINITY=65, NOT_A_NUMBER=66, FLOAT_DIG_DOT_DIG_OPT_EXP=67, 
FLOAT_DIG_EXP=68, FLOAT_DOT_DIG_OPT_EXP=69, IDENT=70, GROUP_NAME=71;

public static readonly string[] ruleNames = {
"LET_DIG_UNDERSCORE", "OCT_DIG", "SIGNAGE", "UNDERSCORE", "X", "ZED", 
"OPEN_CURLY_BRACE", "OPEN_PAREN", "OPEN_SQUARE_BRACKET", "OPTION", "OPTIONAL", 
"UINT32", "UINT64", "BOOLEAN_FALSE", "BOOLEAN_TRUE", "HEX_LIT", "OCT_LIT", 
"FLOAT_DOT_DIG_OPT_EXP", "IDENT", "GROUP_NAME"
};

private static readonly string[] _LiteralNames = {
"';'", "'='", "'extend'", "'extensions'", "'field'", "'group'", "'import'", 
"'double'", "'fixed32'", "'fixed64'", "'float'", "'int32'", "'int64'", 
"'uint64'", "'false'", "'true'", null, null, null, "'inf'", "'nan'"
};

private static readonly string[] _SymbolicNames = {
"COMMA", "DEFAULT", "DOT", "ENUM", "EOS", "EQU", "EXTEND", "EXTENSIONS", 
"TO", "WEAK", "BOOL", "BYTES", "DOUBLE", "FIXED32", "FIXED64", "FLOAT", 
"FLOAT_DOT_DIG_OPT_EXP", "IDENT", "GROUP_NAME"
};

What's going on, understandably so, is that the lexer rules themselves are codified. Fine, this is all well and good. However, I would consider actual KEYWORDS to be a subset of those rules, potentially.

The best I could come up with is to manually identify them:

const string @enum = nameof(@enum);
const string extend = nameof(extend);
const string extensions = nameof(extensions);
const string @double = nameof(@double);
const string @float = nameof(@float);

yield return @enum;
yield return extend;
yield return extensions;
yield return @double;
yield return @float;

Which yields language level tokens in a more usable form, which I can then gather into an enumerable Keyword set.

I just wondered if something like this wasn't already being done for the Antlr Lexer code generation, but it would seem it is not.

Geoff Groos

unread,
Mar 7, 2019, 2:54:16 PM3/7/19
to antlr-discussion
I think you are correct, doesn't have any specific magic for this case.

I believe you can still use semantic predicates even as far down as in lexer rules (please confirm this, I'm not sure now that I've said it), so with that you could provide a custom LexerBase class that your lexer extends and have your semantic predicates call back into your code to tell ANTLR how to parse these characters.

I think a better approach is to make your lexer as simple as possible, letting it generate keyword tokens instead of identifier tokens, then add a special semantic predicate to the parser on a very lax rule like the one Mike suggested.

Hope that helps!

Michael Powell

unread,
Mar 7, 2019, 3:10:00 PM3/7/19
to antlr-di...@googlegroups.com
On Thu, Mar 7, 2019 at 2:54 PM Geoff Groos <groo...@gmail.com> wrote:
>
> I think you are correct, doesn't have any specific magic for this case.
>
> I believe you can still use semantic predicates even as far down as in lexer rules (please confirm this, I'm not sure now that I've said it), so with that you could provide a custom LexerBase class that your lexer extends and have your semantic predicates call back into your code to tell ANTLR how to parse these characters.

Are there examples of this? I'm not quite sure what would be generated
into the target language lexer.

> I think a better approach is to make your lexer as simple as possible, letting it generate keyword tokens instead of identifier tokens, then add a special semantic predicate to the parser on a very lax rule like the one Mike suggested.

Again, what I am after is a usable set of actual keywords, which I
then want to use to prohibit identifiers being named as such. However,
if I can prohibit this at the rule level, that would be better,
assuming I can identify "keywords".
> --
> You received this message because you are subscribed to a topic in the Google Groups "antlr-discussion" group.
> To unsubscribe from this topic, visit https://groups.google.com/d/topic/antlr-discussion/949cUHFZzYA/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to antlr-discussi...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages