Whitespace problem with grammar

737 views
Skip to first unread message

Peter Petrov

unread,
Dec 25, 2019, 2:27:38 PM12/25/19
to antlr-di...@googlegroups.com
I took this grammar from the web and modified it slightly.
Attached is the version I have now. 
It's supposed to be a simple grammar for arithmetic expressions and equalities/inequalities. 

I am using the JavaScript ANTLR v4 runtime. 

antlr-javascript-runtime-4.7.2.zip 


Now... I am running into something curious.

(1+2)^3 is NOT recognized as an expression
but
(1 + 2)^3 is recognized as an expression

In the grammar I do have this lexer rule:

WS
   : [ \r\n\t] + -> skip
   ;

So why does it behave differently on the 2 inputs? 
I have no idea currently why... 
The only difference between them are the whitespaces.


Here is some additional info about the error raised when working on this input

(1+2)^3



C:\MathSystem>node main.js
line 1:2 extraneous input '+2' expecting ')'

====> Found number:  1
CONSTRUCTOR:  NumberContext
VALUE:  1

====> Found atom:  1
CONSTRUCTOR:  AtomContext
VALUE:  1

====> Found expression:
Found child [ 0 ] -> CONSTRUCTOR :  AtomContext TEXT:  1
VALUE:  1

====> Found expression:
Found child [ 0 ] -> CONSTRUCTOR :  TerminalNodeImpl TEXT:  (
Found child [ 1 ] -> CONSTRUCTOR :  ExpressionContext TEXT:  1
Found child [ 2 ] -> CONSTRUCTOR :  ErrorNodeImpl TEXT:  +2
Found child [ 3 ] -> CONSTRUCTOR :  TerminalNodeImpl TEXT:  )
C:\MathSystem\arithmeticListenerImpl.js:80
                value = mp_operators.get(op).exec(this.t.get(ctx.getChild(0)), this.t.get(ctx.getChild(2)));
                                            ^

TypeError: Cannot read property 'exec' of undefined
    at arithmeticListenerImpl.exitExpression (C:\MathSystem\arithmeticListenerImpl.js:80:31)
    at ExpressionContext.exitRule (C:\MathSystem\arithmeticParser.js:326:18)
    at ParseTreeWalker.exitRule (C:\MathSystem\node_modules\antlr4\tree\Tree.js:216:6)
    at ParseTreeWalker.walk (C:\MathSystem\node_modules\antlr4\tree\Tree.js:199:8)
    at ParseTreeWalker.walk (C:\MathSystem\node_modules\antlr4\tree\Tree.js:197:9)
    at ParseTreeWalker.walk (C:\MathSystem\node_modules\antlr4\tree\Tree.js:197:9)
    at Object.<anonymous> (C:\MathSystem\main.js:21:37)
    at Module._compile (internal/modules/cjs/loader.js:778:30)
    at Object.Module._extensions..js (internal/modules/cjs/loader.js:789:10)
    at Module.load (internal/modules/cjs/loader.js:653:32)

C:\MathSystem>




Any help would be much appreciated. 


arithmetic.g4
Message has been deleted

John B. Brodie

unread,
Dec 25, 2019, 2:52:36 PM12/25/19
to antlr-di...@googlegroups.com, Peter Petrov

the + in the +2 is being treated as the SIGN for the SCIENTIFIC_NUMBER.

be sure to dump the Token stream between your lexing and parsing phases in order to check that you are recognizing the tokens in the manner you desire.

i also think SCIENTIFIC_NUMBER has an issue with the exponent portion of the number. both the NUMBER and UNSIGNED_INTEGER fragments will recognize the same whole number input, ANTLR will disambiguate this collision by choosing the first matching rule, e.g. NUMBER.

i have not tried this but maybe changing NUMBER to be something like:

NUMBER: UNSIGNED_INTEGER ('.' UNSIGNED_INTEGER)? ;

(so that the digit range only appears in a single rule....)

Hope this helps

   -jbb

On 12/25/19 2:27 PM, Peter Petrov wrote:
I took this grammar from the web and modified it slightly.
Attached is the version I have now. 
It's supposed to be a simple grammar for arithmetic expressions and equalities/inequalities. 

I am using the JavaScript ANTLR v4 runtime. 

antlr-javascript-runtime-4.7.2.zip 


Now... I am running into something curious.

(1+2)^3 is NOT recognized as an expression
but
(1 + 2)^3 is recognized as an expression
In the grammar I do have this lexer rule:

WS
   : [ \r\n\t] + -> skip
   ;

So why does it behave differently on the 2 inputs? 
I have no idea currently why... 
The only difference between them are the whitespaces.

Any help would be much appreciated. 


--
You received this message because you are subscribed to the Google Groups "antlr-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antlr-discussi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/antlr-discussion/09ee31ba-4fbe-4fc3-aea2-d1bab3da13d7%40googlegroups.com.

Peter Petrov

unread,
Dec 25, 2019, 3:49:49 PM12/25/19
to antlr-discussion
Thanks a lot, I will try these suggestions.

OK... So you think the problem is in the grammar? 
But that does not explain why the spaces in the input fix the issue, or does it...? 


Here are the tokens in the 2 cases (one token on a line).

a) No spaces in the input expression:

(
1
+2
)
^
3
<EOF>


b) Spaces present in the input expression:

(
1
+
2
)
^
3
<EOF>


So yes, it seems like it's trying to tokenize +2 as a number. But why does it behave so only when spaces are missing in the input string?
Isn't that some bug there? I would at least expect a consistent behavior (a behavior where spaces don't matter). 
Maybe I am wrong, I only started playing with ANTLR 3-4 days ago. 



To unsubscribe from this group and stop receiving emails from it, send an email to antlr-di...@googlegroups.com.

Mike Lischke

unread,
Dec 26, 2019, 6:31:22 AM12/26/19
to antlr-discussion

OK... So you think the problem is in the grammar? 

Yes, the grammar has an issue here. Usually unary operators are not specified in lexer rules.

But that does not explain why the spaces in the input fix the issue, or does it...? 

It does. Look at the SCINTIFIC_NUMBER definition. It includes a sign. This conflicts with the PLUS/MINUS rules. ANTLR4 chooses the rule which matches the longest input sequence. Hence a `+2` matches the SCINTIFIC_NUMBER rule (which does not include optional whitespaces between + and 2, hence this rule is only used if there is no whitespace). But if you use a space between + and 2 then the plus is matched as individual token (because SCINTIFIC_NUMBER does not match in this case). Hence with a space you get two tokens (one for the plus and one for the digit).

However, when the plus is matched as unary operator for the SCINTIFIC_NUMBER rule you will have no operator in your expression, which leads to the syntax error. The correct way to specify the unary operator is to move it to the expression rule (note the removed `fragment` keyword for the `SIGN` rule):

expression:
LPAREN expression RPAREN
| SIGN expression
| atom
| expression (TIMES | DIV | PLUS | MINUS | POW) expression
;

SCIENTIFIC_NUMBER: NUMBER (E SIGN? UNSIGNED_INTEGER)?;

SIGN: ('+' | '-');


However, that won't work either, now because of another problem. The left recursive expression rule will always prefer the unary operator alt over the one with the binary operators. This can be seen better when you convert this rule to avoid the left recursion (which is the form that is used by ANTLR4 internally):

expression2:
(
LPAREN expression2 RPAREN
| SIGN expression2
| atom
)
((TIMES | DIV | PLUS | MINUS | POW) expression2)*
;

This form (as well as the original variant) leads to this parse tree:


As you can see the alt `LPAREN expression2 RPAREN` doesn't match the binary op part at the end of the expression2 rule, because it always expects one of the other variants first (an atom, the unary op or an expression in parentheses).

The solution for this problem is to split `expression` into 2 rules:

expression3:
simpleExpr (TIMES | DIV | PLUS | MINUS | POW) simpleExpr
;

simpleExpr:
LPAREN expression3 RPAREN
| SIGN simpleExpr
| atom
;

which finally parses fine (w/o whitespaces):



and here for the input `(1+ -   ++2)^3`:


But wait, we are still not done yet!

When you look closer at the last parse tree you will see that there's never a plus or minus token. All +/- are matched as sign and it looks as if only the unary op was matched (which is not really the case, but it looks so). This is the result of having conflicting lexer rules for + and - (namely `SIGN`, `PLUS` and `MINUS`). As it turns out the `SIGN` rule is completely useless. Just remove it and use the other two rules instead:

simpleExpr:
LPAREN expression3 RPAREN
| (PLUS | MINUS) simpleExpr
| atom
;

For the input `(1 +- 2) ^ 3` you will get now:


You differentiate between unary and binary operator by looking at the parent rule. Is it `simpleExpr` then it must be the unary op, otherwise it's the binary one.


Mike Lischke

unread,
Dec 26, 2019, 6:34:30 AM12/26/19
to antlr-discussion

When you look closer at the last parse tree you will see that there's never a plus or minus token. All +/- are matched as sign and it looks as if only the unary op was matched (which is not really the case, but it looks so)

I just realized it doesn't only look so, but indeed is that case. The binary op alt uses `PLUS` and `MINUS` and since these tokens are not created (due to the dominant `SIGN` rule) this alt can never match.

Mike Lischke

unread,
Dec 26, 2019, 6:48:00 AM12/26/19
to Mike Lischke, antlr-discussion

When you look closer at the last parse tree you will see that there's never a plus or minus token. All +/- are matched as sign and it looks as if only the unary op was matched (which is not really the case, but it looks so)

I just realized it doesn't only look so, but indeed is that case. The binary op alt uses `PLUS` and `MINUS` and since these tokens are not created (due to the dominant `SIGN` rule) this alt can never match.

And yet another follow-up: the conflict with the `SIGN` rule was introduced by making it a non-fragment rule. When removing it also the original expression variant works now:

expression:
LPAREN expression RPAREN
| unaryOp expression
| atom
| expression (TIMES | DIV | PLUS | MINUS | POW) expression
;

unaryOp: PLUS | MINUS;


With the definition of the `unaryOp` rule you can make it more obvious to detect when the unary alt was matched:




This is for the input: `(1 +- 2) ^ 3`


Mike Lischke

unread,
Dec 31, 2019, 5:16:35 AM12/31/19
to Peter Petrov, antlr-di...@googlegroups.com
Hi Peter,

I was able to install your VS Code plugin and play with the grammar and the trees produced on various inputs.
You should have pointed me to this plugin, I just found it by chance :) 

From time to time I post an announcement here for new versions :-)


I still don't get this part though which you wrote here... 

"It does. Look at the SCINTIFIC_NUMBER definition. It includes a sign. This conflicts with the PLUS/MINUS rules. ANTLR4 chooses the rule which matches the longest input sequence. Hence a `+2` matches the SCINTIFIC_NUMBER rule (which does not include optional whitespaces between + and 2, hence this rule is only used if there is no whitespace). But if you use a space between + and 2 then the plus is matched as individual token (because SCINTIFIC_NUMBER does not match in this case). Hence with a space you get two tokens (one for the plus and one for the digit)."

I don't get the whitespaces part.
Where does the grammar say that in  SCIENTIFIC_NUMBER there are no spaces?

At this point you have to understand how parser and lexer work in ANTLR4. The lexer tokenizes the input char by char (regardless whether it is a whitespace or something else, it just follows the lexer rules). For many languages whitespaces are separators between words and their combination and number doesn't matter at all for parsing the input. That's why they are usually skipped. This is done by adding an additional rule only for whitespaces and either put the token on a non-default channel (usually the one titled HIDDEN) or entirely skip it (by using the lexer action `-> skip`).

The parser only sees tokens that are on the default channel and hence never has to deal with whitespaces in such a scenario.

In the lexer however, if you want to make whitespaces part of any token, you have to explicitly specify them. Your scientific number rules does not specify any whitespace:

SCIENTIFIC_NUMBER: SIGN? NUMBER (E (PLUS | MINUS)? UNSIGNED_INTEGER)?;

Hence it will only match when a SIGN is directly followed by a NUMBER (if there's a +/- at all). As soon as you add a space after the + it cannot match this rule anymore. Instead it will match the PLUS rule then. Look at the token list in the debugger in vscode to see what tokens have been match. Checking the recognized tokens should always be your first step when debugging parser misbehavior.

My argumentation is now that you don't need a leading sign at all here, if you have also PLUS/MINUS and you add a unary op alt to your expression rule. It's usually necessary anyway, since any expression, not only numbers should allow for a single leading plus or minus. With that unary op it is no longer necessary to keep the leading `SIGN?` Part in `SCIENTIFIC_NUMBER` and your problem is solved (and the grammar is even improved by accepting the pretty common unary ops and it does no longer matter how many whitespaces are between a unary +/- and a number).

Where does it deal with spaces really? 
Or... are spaces assumed as some default separators regardless of what the grammar says about them.

My idea of this is wrong it seems... I thought before lexing/parsing all spaces are removed due to this rule 

WS
   : [ \r\n\t] + -> skip
   ;

and only then ANTLR does lexing i.e. tokenizing and finally parsing. 

But apparently that's not how it works. 

Could you elaborate on this a bit? The rest I think I understand in full now. 

Once again thanks a lot.

Regards, 
Peter



 




On Mon, Dec 30, 2019 at 10:11 AM Peter Petrov <p.a.p...@gmail.com> wrote:
I can't say how much I thank you for this detailed investigation which you did for all the advices you provided.

Apparently you did many edits and tried many things.
I think I follow most of your notes, I will reread them 2-3 more times. 

Still... in the last version which you have of the grammar, what happens with the SIGN? 
You said it's not needed. But what do we put in the SCIENTIFIC_NUMBER definition?
Just (PLUS | MINUS)? 

Could you send me the last version which you have of the g4 file, just want to visually compare it with mine to better understand the final version?

Many thanks once again!

Peter







<parsetree5.png>



This is for the input: `(1 +- 2) ^ 3`


--
You received this message because you are subscribed to the Google Groups "antlr-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antlr-discussi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/antlr-discussion/84AEADD6-1F5A-4ADE-8030-E4CD76540955%40googlemail.com.


Peter Petrov

unread,
Dec 31, 2019, 6:49:22 PM12/31/19
to antlr-di...@googlegroups.com, Mike Lischke
Never mind, I think I get it now.
I was able to install your VS Code plugin and play with the grammar and the trees produced on various inputs.
You should have pointed me to this plugin, I just found it by chance :) 

I still don't get this part though which you wrote here... 

"It does. Look at the SCINTIFIC_NUMBER definition. It includes a sign. This conflicts with the PLUS/MINUS rules. ANTLR4 chooses the rule which matches the longest input sequence. Hence a `+2` matches the SCINTIFIC_NUMBER rule (which does not include optional whitespaces between + and 2, hence this rule is only used if there is no whitespace). But if you use a space between + and 2 then the plus is matched as individual token (because SCINTIFIC_NUMBER does not match in this case). Hence with a space you get two tokens (one for the plus and one for the digit)."

I don't get the whitespaces part.
Where does the grammar say that in  SCIENTIFIC_NUMBER there are no spaces?
Where does it deal with spaces really? 
Or... are spaces assumed as some default separators regardless of what the grammar says about them.

My idea of this is wrong it seems... I thought before lexing/parsing all spaces are removed due to this rule 

WS
   : [ \r\n\t] + -> skip
   ;

and only then ANTLR does lexing i.e. tokenizing and finally parsing. 

But apparently that's not how it works. 

Could you elaborate on this a bit? The rest I think I understand in full now. 

Once again thanks a lot.

Regards, 
Peter



 
On Mon, Dec 30, 2019 at 10:11 AM Peter Petrov <p.a.p...@gmail.com> wrote:
I can't say how much I thank you for this detailed investigation which you did for all the advices you provided.

Apparently you did many edits and tried many things.
I think I follow most of your notes, I will reread them 2-3 more times. 

Still... in the last version which you have of the grammar, what happens with the SIGN? 
You said it's not needed. But what do we put in the SCIENTIFIC_NUMBER definition?
Just (PLUS | MINUS)? 

Could you send me the last version which you have of the g4 file, just want to visually compare it with mine to better understand the final version?

Many thanks once again!

Peter







On Thu, Dec 26, 2019 at 6:48 AM 'Mike Lischke' via antlr-discussion <antlr-di...@googlegroups.com> wrote:
--
You received this message because you are subscribed to the Google Groups "antlr-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antlr-discussi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/antlr-discussion/84AEADD6-1F5A-4ADE-8030-E4CD76540955%40googlemail.com.

Peter Petrov

unread,
Dec 31, 2019, 6:49:22 PM12/31/19
to antlr-di...@googlegroups.com, Mike Lischke
I can't say how much I thank you for this detailed investigation which you did for all the advices you provided.

Apparently you did many edits and tried many things.
I think I follow most of your notes, I will reread them 2-3 more times. 

Still... in the last version which you have of the grammar, what happens with the SIGN? 
You said it's not needed. But what do we put in the SCIENTIFIC_NUMBER definition?
Just (PLUS | MINUS)? 

Could you send me the last version which you have of the g4 file, just want to visually compare it with mine to better understand the final version?

Many thanks once again!

Peter







On Thu, Dec 26, 2019 at 6:48 AM 'Mike Lischke' via antlr-discussion <antlr-di...@googlegroups.com> wrote:
--
You received this message because you are subscribed to the Google Groups "antlr-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antlr-discussi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/antlr-discussion/84AEADD6-1F5A-4ADE-8030-E4CD76540955%40googlemail.com.
Reply all
Reply to author
Forward
0 new messages