Keywords as identifiers - again - but different problem

Mike Lischke

unread,

Feb 7, 2013, 10:46:27 AM2/7/13

to antlr-di...@googlegroups.com

Hi,

the problem of allowing non-reserved keywords as identifiers in ANTLR (3.4) has been discussed a lot already and the usual solution is something like this:

identifier:

IDENTIFIER

| keyword

;

keyword:

kw1

| kw2

| kw3

...

;

This works fine but has a big drawback: when you have a large number of keywords this makes your parser huge. I'm talking of a size of ~24MB currently (C target).

In order to get this size down (because it causes a lot of trouble in various tools including XCode) I tried to write an is_keyword() function and use it so:

identifier:

{is_keyword(LA(1))}? => .

| IDENTIFIER

;

This brought down the parser size to ~7MB (great!) but gives me a large number of warnings (Input such as ... is insufficiently covered with predicates), probably because of the dot and the hoisting that is done in the parser. Additionally, I get an error that certain alts never can get matched. So this construct doesn't work at all.

My thought was now that I could avoid this if I can replace the dot with the actual token. But how can I use the current lexer token instead? Any other idea I could use instead?

Mike

--

www.soft-gems.net

Jim Idle

unread,

Feb 7, 2013, 12:54:13 PM2/7/13

to antlr-di...@googlegroups.com

As also said before. Prefer your letter tokens such that all the reserved words are ordinal and all the keywords are ordinal. This will reduce switches and if statements and despite lots if code seemingly generated, the c compiler will optimize it just fine.

There are trade offs in any solution but larger code dies not always mean slower speed.

Jim

--
You received this message because you are subscribed to the Google Groups "antlr-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antlr-discussi...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Mike Lischke

unread,

Feb 8, 2013, 3:11:57 AM2/8/13

to antlr-di...@googlegroups.com

Hey Jim,

As also said before. Prefer your letter tokens such that all the reserved words are ordinal and all the keywords are ordinal.

By that you mean to define dedicated lexer tokens for each keyword instead using literals in parser rules? Doesn't ANTLR generate ids for all literals anyway (anonymous ids if not explicitly declared)?

This will reduce switches and if statements and despite lots if code seemingly generated, the c compiler will optimize it just fine.

Might well be, but this request is not about code efficiency but source code size. Handling so large source code files (the lexer is almost 40MB in size, the parser currently ~24MB) is difficult and only few text tools are prepared for so large files. For instance we have a commit hook that checks for mixed line endings which takes a little eternity on those files. Diff viewers crash, IDEs cannot display the files etc. etc. So I want to bring down the sheer line count in these files. And I think it should be possible if I only could find the right screws to turn. It all boils down to two facts: many keywords that can be identifiers (parser size) and identifers which can consist of almost any Unicode char (lexer size).

Thanks,

Mike

--

www.soft-gems.net

Jim Idle

unread,

Feb 8, 2013, 3:18:51 AM2/8/13

to antlr-di...@googlegroups.com

When you make them contiguous, it reduces the generated code drastically.

Because the huge ifs and ands collapse into just a few and so on. It may not be drastically reducing the total line count though. But sed can handle millions of lines instantly ;).

Also, you should really not keep the generated code in an SCM. You may have a good reason.

Grammar complexity may just result in large files I am afraid.

Jim

Mike Lischke

unread,

Feb 8, 2013, 4:48:28 AM2/8/13

to antlr-di...@googlegroups.com

When you make them contiguous, it reduces the generated code drastically.

The only way to influence this is how they are listed in the grammar, correct?

Because the huge ifs and ands collapse into just a few and so on. It may not be drastically reducing the total line count though.

From what I saw this is already the case. Ranges are used in many places. I assume it's the large list of tokens in switches which are used for lookahead that is the reason (because identifiers can appear in so many places and this list of keywords is duplicated over and over again). Maybe this could be improved a lot by moving the switch statement (or this specific part) to a separate function. Just an idea.

But sed can handle millions of lines instantly ;).

:-D maybe we should just change our development environment to be based on sed mainly ...

Grammar complexity may just result in large files I am afraid.

Absolutely, but if you see that removing a single feature is dropping the size by >300% you start thinking...

Mike

--

Mike Lischke, Principal Software Engineer

MySQL Developer Tools

Oracle Corporation, www.oracle.com

Jim Idle

unread,

Feb 9, 2013, 2:28:09 AM2/9/13

to antlr-di...@googlegroups.com

Yeah. The problem with generating code is that you are trying to cover all possibilities. Clearly you can do better with hand crafted code but then you have to maintain it ;)

I was going to write the MySQL grammar at one point in time but life changes mean spare time is short these days.

The construction of the grammar influences things a lot. For a start though, don't let ANTLR generate the token numbers. Pre-declare them in a .tokens file and import it in to the lexer. Just start with the generated version and rearrange them.

Are you working in a MySQL grammar? If so, I'll send you my Tsql grammar.

Jim

Mike Lischke

unread,

Feb 9, 2013, 3:57:13 AM2/9/13

to antlr-di...@googlegroups.com

The construction of the grammar influences things a lot. For a start though, don't let ANTLR generate the token numbers. Pre-declare them in a .tokens file and import it in to the lexer. Just start with the generated version and rearrange them.

Ah, didn't know this is possible. Thank you. I'll investigate in that direction.

Are you working in a MySQL grammar? If so, I'll send you my Tsql grammar.

Yep, doing final touches on it (mostly optimizations). The grammar is complete and covers all versions from 4.1 to 5.6 (which was recently released as GA). Version specific features are switchable at runtime. As soon as I have the approval I plan to publish it in the ANTLR grammar dir.

Thanks,

Mike

--

www.soft-gems.net

Mike Lischke

unread,

Feb 25, 2013, 5:09:30 AM2/25/13

to antlr-di...@googlegroups.com

Hey Jim,

The construction of the grammar influences things a lot. For a start though, don't let ANTLR generate the token numbers. Pre-declare them in a .tokens file and import it in to the lexer. Just start with the generated version and rearrange them.

This is actually a great tip! I tried it and the generated parser size went down to 9MB. One should explain the influence of the tokens file much more.

Btw. I'm not sure this page http://www.antlr.org/wiki/pages/viewpage.action?pageId=1741 tells the truth entirely. I had a variation of the first solution and what happend was that before the predicate was executed the parser first looked for ID, so it never got to the predicates. Hence my keywords rule never matched.

My rule looked similar to this:

keyword:
{LA(1) >=keyword1 && LA(1) <= keyword300}? ID
;

The second solution is what works well, especially if you specify your token ids in an order that fits best the order in that rule.

Are you working in a MySQL grammar? If so, I'll send you my Tsql grammar.

Yes, please do. I certainly can learn something from it.

Thanks,

Mike

Mike Lischke

unread,

Feb 25, 2013, 5:11:02 AM2/25/13

to antlr-di...@googlegroups.com

keyword:
{LA(1) >=keyword1 && LA(1) <= keyword300}? ID
;

To avoid confusions that should rather be KEYWORD1 and KEYWORD300, since those are lexer tokens.

Mike

Reply all

Reply to author

Forward