Blogging about MySQL Workbench's parser infrastructure

128 views
Skip to first unread message

Mike Lischke

unread,
Aug 17, 2015, 2:58:20 AM8/17/15
to antlr-di...@googlegroups.com
For those interested I wrote a blog post that roughly explains the parsing infrastructure in MySQL Workbench, which makes heavy use of ANTLR generated lexer + parsers.


Another one will follow explaining the details behind our code completion implementation.

Thanks again to Ter, Sam and Jim! Without you we wouldn't have this powerful toolset. Much appreciated!

Jim Idle

unread,
Aug 17, 2015, 9:10:30 PM8/17/15
to antlr-di...@googlegroups.com
I read it through - I think you can probably reduce the size of the lexer by not specifying the code points directly but using code along the lines of:

ID : { isIdStart() }? { consumeID(); } ;

Building the knowledge of which characters are valid in to the functions (hopefully by using builtin OS functions, but if you have to then constructing a table(s) of the characters yourself.

I would be happy to take a look at this in a little more detail if you want to let me see the lexer grammar.

Cheers for the nod in the article.

Jim


--
You received this message because you are subscribed to the Google Groups "antlr-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antlr-discussi...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Mike Lischke

unread,
Aug 18, 2015, 3:18:07 AM8/18/15
to antlr-di...@googlegroups.com
Hey Jim,

I read it through - I think you can probably reduce the size of the lexer by not specifying the code points directly but using code along the lines of:

ID : { isIdStart() }? { consumeID(); } ;

I have thought about reducing lexer and parser sizes, but always from the parser view point (avoid defining hundreds of keywords in the grammar), but couldn't find a good solution yet. Looks like it would be worth to also see if the lexer side can be optimized. Thanks for the hint.


Building the knowledge of which characters are valid in to the functions (hopefully by using builtin OS functions, but if you have to then constructing a table(s) of the characters yourself.

I would be happy to take a look at this in a little more detail if you want to let me see the lexer grammar.

The grammar is part of the MySQL Workbench zip package (http://dev.mysql.com/downloads/workbench/, pick the "source code" entry) or can be downloaded here: http://wb.fabforce.eu/wp-content/uploads/MySQL.g.zip.


Jim Idle

unread,
Aug 18, 2015, 4:13:25 AM8/18/15
to antlr-di...@googlegroups.com
Just took a quick look. I think the lexer can be reduced (assuming that it causes any problems as it is). I'll offer some suggestions tomorrow.

jim

Mike Lischke

unread,
Aug 21, 2015, 2:38:37 AM8/21/15
to antlr-discussion
Hey Jim,


On Tuesday, August 18, 2015 at 10:13:25 AM UTC+2, Jim Idle wrote:
Just took a quick look. I think the lexer can be reduced (assuming that it causes any problems as it is). I'll offer some suggestions tomorrow.

Any further ideas that came to your mind for optimizing the lexer? I haven't come around testing your ID variant, but am curious what else can be done.

Jim Idle

unread,
Aug 21, 2015, 4:02:24 AM8/21/15
to antlr-di...@googlegroups.com
Was busy with work so didn't yet get time to look at it. Should have a little time over the weekend, but I suspect that if you use code to consume large sets of character classes, then you will find the code gen drops massively.

Jim

Mike Lischke

unread,
Aug 28, 2015, 9:35:18 AM8/28/15
to antlr-di...@googlegroups.com
Hey Jim,

I read it through - I think you can probably reduce the size of the lexer by not specifying the code points directly but using code along the lines of:

ID : { isIdStart() }? { consumeID(); } ;

Can you explain a bit more about that solution? My approach leads to no-viable-alt errors:

IDENTIFIER: {isIDStart(ctx) } ? { consumeID(ctx); };

with:

  ANTLR3_BOOLEAN isIDStart(pMySQLLexer ctx)
  {
    int input = LA(1);
    return (input >= '0' && <= '9') || (input >= 'A'  && <= 'Z') || // Only upper case, as we use a case insensitive parser (insensitive only for ASCII).
 (input == '$') || (input == '_') || (input >= '\u0080' && input < '\uffff');
  }
  
  voide consumeID(pMySQLLexer ctx)
  {
    int input = LA(1);
    while ((input >= '0' && <= '9') || (input >= 'A'  && <= 'Z') || (input == '$') ||
      (input == '_') || (input >= '\u0080' && input < '\uffff'))
    {
      CONSUME();
      //LEXSTATE->type = IDENTIFIER;
        
      input = LA(1);
    }
  }
  
Since the IDENTIFIER rule now contains only a predicate and an action we probably need some value to "match". But since I consume everything in the action I'm not sure what to put there.


Mike Lischke

unread,
Aug 28, 2015, 10:22:49 AM8/28/15
to antlr-di...@googlegroups.com
Jim,

I read it through - I think you can probably reduce the size of the lexer by not specifying the code points directly but using code along the lines of:

ID : { isIdStart() }? { consumeID(); } ;

Can you explain a bit more about that solution? My approach leads to no-viable-alt errors:

IDENTIFIER: {isIDStart(ctx) } ? { consumeID(ctx); };

I'm a bit further with that but can't get it to work really. I overcame the no-viable-alt errors by using:

IDENTIFIER: { isIDStart(ctx) }? => { consumeID(ctx); EMIT(); };

however, I still encounter 2 problems with that. The mIDENTIFIER() function in my lexer is never triggered. The lexer behaves as if the IDENTIFIER rule wouldn't be there at all.

The other problem is that I have a rule:

UNDERSCORE_CHARSET: UNDERLINE_SYMBOL IDENTIFIER { $type = check_charset(PAYLOAD, $text); };

It uses the IDENTIFIER rule and with that variant from above I get the error:

error(208): MySQL.g:4719:1: The following token definitions can never be matched because prior tokens match the same input: UNDERSCORE_CHARSET


UNDERSCORE_CHARSET is defined before the IDENTIFIER rule.

any help to solve this is appreciated.


Jim Idle

unread,
Aug 30, 2015, 9:05:10 PM8/30/15
to antlr-di...@googlegroups.com
Yeah - I have to look up how I have done this before, but I think that you want a . then have the predicate reject it if it is not in range. Because the ranges are so large, then using a normal rule results in large tables, so taking them in to your own code like this should reduce the size of the tables a lot (IIRC).

Jim

Jim Idle

unread,
Aug 30, 2015, 9:10:19 PM8/30/15
to antlr-di...@googlegroups.com
On Fri, Aug 28, 2015 at 10:22 PM, 'Mike Lischke' via antlr-discussion <antlr-di...@googlegroups.com> wrote:
Jim,

I read it through - I think you can probably reduce the size of the lexer by not specifying the code points directly but using code along the lines of:

ID : { isIdStart() }? { consumeID(); } ;

Can you explain a bit more about that solution? My approach leads to no-viable-alt errors:

IDENTIFIER: {isIDStart(ctx) } ? { consumeID(ctx); };

I'm a bit further with that but can't get it to work really. I overcame the no-viable-alt errors by using:

IDENTIFIER: { isIDStart(ctx) }? => { consumeID(ctx); EMIT(); };

however, I still encounter 2 problems with that. The mIDENTIFIER() function in my lexer is never triggered. The lexer behaves as if the IDENTIFIER rule wouldn't be there at all.

try   { isIDStart(ctx) }? . { consumeID(ctx); EMIT(); };

I think that we might then run in to not enough predicate coverage or something. It's been a while since I did this. I don't think that you need the EMIT() call.

The other problem is that I have a rule:

UNDERSCORE_CHARSET: UNDERLINE_SYMBOL IDENTIFIER { $type = check_charset(PAYLOAD, $text); };

It uses the IDENTIFIER rule and with that variant from above I get the error:

error(208): MySQL.g:4719:1: The following token definitions can never be matched because prior tokens match the same input: UNDERSCORE_CHARSET

Instead of calling the rule, just repeat the predicate and call the consumeID rule I htink. 


UNDERSCORE_CHARSET is defined before the IDENTIFIER rule.

any help to solve this is appreciated.

Mike

Let me know if you get further this way,

JIm

Mike Lischke

unread,
Aug 31, 2015, 6:26:06 AM8/31/15
to antlr-di...@googlegroups.com
Hey Jim,

thanks for your feedback. See below:


IDENTIFIER: { isIDStart(ctx) }? => { consumeID(ctx); EMIT(); };

however, I still encounter 2 problems with that. The mIDENTIFIER() function in my lexer is never triggered. The lexer behaves as if the IDENTIFIER rule wouldn't be there at all.

try   { isIDStart(ctx) }? . { consumeID(ctx); EMIT(); };

I think that we might then run in to not enough predicate coverage or something. It's been a while since I did this. I don't think that you need the EMIT() call.

Yes, right, the EMIT() call isn't necessary. But I get weird results with such a rule:

1) If called with a single letter, it works as expected. No syntax error.

2) If called with 2 letters (and more) the outcome depends on what I use. For letter combinations that start a keyword I get a syntax error and the first letter is cut off. Apparently already before mIDENTIFIER() is called. Many other letter combination work fine:

For "select s" I get:

(line: 1, offset: 0, nil)    nil
(line: 1, offset: 0, length: 6, index: 0, SELECT_SYMBOL [686])    select
(line: 1, offset: 0, length: 1, index: 0, SELECT_EXPR_TOKEN [685])    SELECT_EXPR_TOKEN
(line: 1, offset: 0, length: 1, index: 0, EXPRESSION_TOKEN [496])    EXPRESSION_TOKEN
(line: 1, offset: 0, length: 1, index: 0, COLUMN_REF_TOKEN [425])    COLUMN_REF_TOKEN
(line: 1, offset: 7, length: 1, index: 2, IDENTIFIER [528])    s
(line: 1, offset: 0, length: 1, index: 0, ANTLR3_TOKEN_EOF [-1])    <EOF>


For "select se" I get:

(line: 1, offset: 0, nil)    nil
(line: 1, offset: 0, length: 6, index: 0, SELECT_SYMBOL [686])    select
(line: 1, offset: 0, length: 1, index: 0, SELECT_EXPR_TOKEN [685])    SELECT_EXPR_TOKEN
(line: 1, offset: 0, length: 1, index: 0, EXPRESSION_TOKEN [496])    EXPRESSION_TOKEN
(line: 1, offset: 0, length: 1, index: 0, COLUMN_REF_TOKEN [425])    COLUMN_REF_TOKEN
(line: 1, offset: 8, length: 1, index: 2, IDENTIFIER [528])    e
(line: 1, offset: 0, length: 1, index: 0, ANTLR3_TOKEN_EOF [-1])    <EOF>

The error here is rather strange:

1 errors found
sql-script(1)  : lexer error 3 : 1:1: Tokens : ( EQUAL_OPERATOR | ASSIGN_OPERATOR |  ... <all my keywords here> ... DOUBLE_QUOTED_TEXT | SINGLE_QUOTED_TEXT | COMMENT_RULE );, at offset 10(end of input).
This indicates a poorly specified lexer RULE
or unterminated input element such as: "STRING["]
 unexpected input

This doesn't tell me anything wrt to my identifier rule. Error 3 comes before the lexer goes into recovery mode, right?

3) Some letter combinations return a different error. Though I cannot find why they are special:

For "select ss" I get:

(line: 1, offset: 0, nil)    nil
(line: 1, offset: 0, length: 6, index: 0, SELECT_SYMBOL [686])    select
(line: 0, offset: 0, length: 1, index: 0, <invalid> [0])    Tree Error Node
(line: 1, offset: 0, length: 1, index: 0, ANTLR3_TOKEN_EOF [-1])    <EOF>

With error:

2 errors found
sql-script(1)  : lexer error 1 : Unexpected character, at offset 10(end of input).
This indicates a poorly specified lexer RULE
or unterminated input element such as: "STRING["]
 : syntax error...
-end of input-(1)  : parser error 3 : , at offset 0, at <EOF> unexpected input

4) Numbers are no longer automatically lex'ed correctly.

Input like "select 1" gives me now:

(line: 1, offset: 0, nil)    nil
(line: 1, offset: 0, length: 6, index: 0, SELECT_SYMBOL [686])    select
(line: 1, offset: 0, length: 1, index: 0, SELECT_EXPR_TOKEN [685])    SELECT_EXPR_TOKEN
(line: 1, offset: 0, length: 1, index: 0, EXPRESSION_TOKEN [496])    EXPRESSION_TOKEN
(line: 1, offset: 0, length: 1, index: 0, COLUMN_REF_TOKEN [425])    COLUMN_REF_TOKEN
(line: 1, offset: 7, length: 1, index: 2, IDENTIFIER [528])    1
(line: 1, offset: 0, length: 1, index: 0, ANTLR3_TOKEN_EOF [-1])    <EOF>

while it originally was:

(line: 1, offset: 0, nil)    nil
(line: 1, offset: 0, length: 6, index: 0, SELECT_SYMBOL [687])    select
(line: 1, offset: 0, length: 1, index: 0, SELECT_EXPR_TOKEN [686])    SELECT_EXPR_TOKEN
(line: 1, offset: 0, length: 1, index: 0, EXPRESSION_TOKEN [496])    EXPRESSION_TOKEN
(line: 1, offset: 7, length: 1, index: 2, INT_NUMBER [549])    1
(line: 1, offset: 0, length: 1, index: 0, ANTLR3_TOKEN_EOF [-1])    <EOF>

I tried different predicate types (e.g. gated semantic predicate) but the outcome was always the same, even though the generated code was significantly different.

For completeness, here's what I used last for testing:

IDENTIFIER:
{ isIDStart(ctx) }? => . { consumeID(ctx); }
|
;

and the C code:

  ANTLR3_BOOLEAN isIDStart(pMySQLLexer ctx)
  {
    ANTLR3_UINT32 input = LA(1);
    
     // Only upper case, as we use a case insensitive parser (insensitive only for ASCII).
    return (input >= '0' && input <= '9') || (input >= 'A' && input <= 'Z') ||
 (input == '$') || (input == '_') || (input >= 0x80 && input < 0xFFFF);
  }
  
  void consumeID(pMySQLLexer ctx)
  {
    ANTLR3_UINT32 input = LA(1);
    while ((input >= '0' && input <= '9') || (input >= 'A'  && input <= 'Z') || (input == '$') ||
      (input == '_') || (input >= 0x80 && input < 0xFFFF))
    {
      CONSUME();
      input = LA(1);
    }
  }

For the number recognition I could of course check in the consumeID function to see if the identifier consists purely of digits. But this was necessary before, so I hesitate. But first the other problems must be solved.

Again, I would be thankful for any help with that. Ah btw, the lexer .c file size went down to ~8MB (from ~40MB), so that idea itself seems right.

Mike
-- 

Jim Idle

unread,
Aug 31, 2015, 9:18:09 PM8/31/15
to antlr-di...@googlegroups.com
Yeah - I seem to to remember that it gets tricky. 

The numbers look like theyt are not parsing because the . allows an ID to start with anything but your predicate should reject the rule if it sees that the first character (the character last consumed, and therefore the first character of the token) is not valid for the start of an identifier. However the code you show is allowing '0'-'9' as valid id start characters.

The other errors are caused, I suspect because the . is fine for select s but isn't predicting select sg and similar. However, I thought that this would not matter as having predicted one character it should call the action and let you consume the rest. Perhaps it will change if you change the is start character function so that it rejects numerics correctly.

I'll have to look at the generated code to see exactly why this happens, but the predicate code is not always as simple as it appears.


Mike Lischke

unread,
Sep 2, 2015, 5:16:01 AM9/2/15
to antlr-di...@googlegroups.com
Yeah - I seem to to remember that it gets tricky. 

The numbers look like theyt are not parsing because the . allows an ID to start with anything but your predicate should reject the rule if it sees that the first character (the character last consumed, and therefore the first character of the token) is not valid for the start of an identifier. However the code you show is allowing '0'-'9' as valid id start characters.

Typical case: it starts so easy and you keep adding stuff, and adding and testing and adding :-D

In the meantime I have a found a grammar which uses a similar approach: http://research.xebic.com/es3/. They first match ASCII chars in the id rule and have an empty alt with an action to handle the rest. This works fine (I created a JS syntax checker from that), so I will look into what it takes to port this to my MySQL grammar.


Reply all
Reply to author
Forward
0 new messages