Newest C++ runtime - problem with comments

110 views
Skip to first unread message

tor.jer...@att.net

unread,
Sep 24, 2016, 1:36:00 PM9/24/16
to antlr-discussion

I had an earlier version of the C++ runtime and antlr4 jar file that worked fairly well. I've tried to upgrade to the newest version of the C++ runtime, but now it seems to fail when trying to add a standard comment rule in the lexer.

Here is a simplified split parser/lexer use case.

The parser rules for a simple number expression grammar:

parser grammar Test;

options {
  language=Cpp;
  tokenVocab=TestL;
}

eval
    :    additionExp    ;

additionExp
    :    multiplyExp ( PLUS multiplyExp | MINUS multiplyExp )* ;

multiplyExp
    :    atomExp( STAR atomExp | DIV atomExp )* ;

atomExp
    :    NUMBER | LPAR additionExp RPAR
    ;

The lexer file is:

lexer grammar TestL;

options {
  language=Cpp;
}

NUMBER :    ('0'..'9')+ ('.' ('0'..'9')+) ;

Comment:   '//' .*? ('\r')? '\n' -> channel(HIDDEN) ;

WS  :   (' ' | '\t' | '\r'| '\n') -> channel(HIDDEN) ;

PLUS : '+' ;
MINUS : '-';
STAR : '*';
DIV : '/';
LPAR : '(';
RPAR : ')';
 
The problem is that while the white space rule works fine, the comment rule does not (I've tried every form of comment rules that I could find online).

Given the input file:
// this is a comment
   
5 + 3 * 5

The above grammar produces the following output:

line 1:0 mismatched input '/' expecting '{NUMBER, '('}'
line 1:3 token recognition error at: 't'
line 1:4 token recognition error at: 'h'
line 1:5 token recognition error at: 'i'
line 1:6 token recognition error at: 's'
line 1:8 token recognition error at: 'i'
line 1:9 token recognition error at: 's'
line 1:11 token recognition error at: 'a'
line 1:13 token recognition error at: 'c'
line 1:14 token recognition error at: 'o'
line 1:15 token recognition error at: 'm'
line 1:16 token recognition error at: 'm'
line 1:17 token recognition error at: 'e'
line 1:18 token recognition error at: 'n'
line 1:19 token recognition error at: 't'
line 1:1 extraneous input '/' expecting '{NUMBER, '('}'
Parse Tree: (eval (additionExp (multiplyExp atomExp / (atomExp / 5)) + (multiplyExp (atomExp 3) * (atomExp 5))))

If I change the Comment rule to:

Comment : '//' -> channel(HIDDEN) ;

At last the initial part gets recognized. While that's hardly all that useful it may give some pointer to where the issue lies.

line 1:3 token recognition error at: 't'
line 1:4 token recognition error at: 'h'
line 1:5 token recognition error at: 'i'
line 1:6 token recognition error at: 's'
line 1:8 token recognition error at: 'i'
line 1:9 token recognition error at: 's'
line 1:11 token recognition error at: 'a'
line 1:13 token recognition error at: 'c'
line 1:14 token recognition error at: 'o'
line 1:15 token recognition error at: 'm'
line 1:16 token recognition error at: 'm'
line 1:17 token recognition error at: 'e'
line 1:18 token recognition error at: 'n'
line 1:19 token recognition error at: 't'
Parse Tree: (eval (additionExp (multiplyExp (atomExp 5)) + (multiplyExp (atomExp 3) * (atomExp 5)))) 

For completeness, here is the C++ driver (VS2015):

#include <iostream>
#include <antlr4-runtime.h>
#include "TestL.h"
#include "Test.h"

using namespace antlr4;

int main(int argc, char **argv)
{
std::ifstream inFile("test.tst");
antlr4::ANTLRInputStream istrm(inFile);
TestL lexer(&istrm);
CommonTokenStream tokens(&lexer);
Test parser(&tokens);

std::shared_ptr<tree::ParseTree> tree = parser.eval();

std::string s = tree->toStringTree(&parser);

std::cout << "Parse Tree: " << s << std::endl;
return 0;
}

Best regards,

TJ

Mike Lischke

unread,
Sep 25, 2016, 6:55:33 AM9/25/16
to antlr-di...@googlegroups.com
Hi Tor,

I had an earlier version of the C++ runtime and antlr4 jar file that worked fairly well. I've tried to upgrade to the newest version of the C++ runtime, but now it seems to fail when trying to add a standard comment rule in the lexer.

...

The lexer file is:

lexer grammar TestL;

options {
  language=Cpp;
}

NUMBER :    ('0'..'9')+ ('.' ('0'..'9')+) ;

Comment:   '//' .*? ('\r')? '\n' -> channel(HIDDEN) ;

WS  :   (' ' | '\t' | '\r'| '\n') -> channel(HIDDEN) ;

PLUS : '+' ;
MINUS : '-';
STAR : '*';
DIV : '/';
LPAR : '(';
RPAR : ')';
 
The problem is that while the white space rule works fine, the comment rule does not (I've tried every form of comment rules that I could find online).

Hmm, I don't see a problem here. Have you tried to use the latest code from Github? Your NUMBER rule is missing a ? otherwise it will not match your numbers in the test input. In order to help finding the problem I uploaded new files to my homepage (new jar, new source zip and new prebuilt binaries).


Message has been deleted

tor.jer...@att.net

unread,
Sep 25, 2016, 11:50:29 AM9/25/16
to antlr-discussion
Mike,

Thank you. I downloaded both the source and binary distributions of the new runtime, as well as the jar file (from http://www.soft-gems.net/index.php/all-downloads). The results are exactly the same - i.e., it's not working like it is supposed to.

BTW (the question mark for the NUMBER rule was in my file, just didn't make it to the copy & paste).

Best regards,

TJ

Mike Lischke

unread,
Sep 25, 2016, 12:17:09 PM9/25/16
to antlr-di...@googlegroups.com
Thank you. I downloaded both the source and binary distributions of the new runtime, as well as the jar file (from http://www.soft-gems.net/index.php/all-downloads). The results are exactly the same - i.e., it's not working like it is supposed to.


How does the token list look like? Is the comment tokenized as such?


tor.jer...@att.net

unread,
Sep 25, 2016, 12:33:44 PM9/25/16
to antlr-discussion
The token list is:

NUMBER=1
Comment=2
WS=3
PLUS=4
MINUS=5
STAR=6
DIV=7
LPAR=8
RPAR=9
'+'=4
'-'=5
'*'=6
'/'=7
'('=8
')'=9

Tor

Mike Lischke

unread,
Sep 26, 2016, 3:30:17 AM9/26/16
to antlr-di...@googlegroups.com
You sent me the token values, but I was after the tokens found by your lexer :-)


tor.jer...@att.net

unread,
Sep 26, 2016, 10:51:44 AM9/26/16
to antlr-discussion
Yes, that sounds more reasonable. Here is the printout when printing the type and text of the tokens. Seems like the beginning of the comment is not recognized, but taken as two '/' divide tokens.

Getting tokens 24
Token 0:  type: 7 text: '/'
Token 1:  type: 7 text: '/'
Token 2:  type: 3 text: ' '
Token 3:  type: 3 text: ' '
Token 4:  type: 3 text: ' '
Token 5:  type: 3 text: ' '
Token 6:  type: 3 text: '
'
Token 7:  type: 3 text: ' '
Token 8:  type: 3 text: ' '
Token 9:  type: 3 text: ' '
Token 10:  type: 3 text: '
'
Token 11:  type: 1 text: '5'
Token 12:  type: 3 text: ' '
Token 13:  type: 4 text: '+'
Token 14:  type: 3 text: ' '
Token 15:  type: 1 text: '3'
Token 16:  type: 3 text: ' '
Token 17:  type: 6 text: '*'
Token 18:  type: 3 text: ' '
Token 19:  type: 1 text: '5'
Token 20:  type: 3 text: '
'
Token 21:  type: 3 text: '
'
Token 22:  type: 3 text: '
'
Token 23:  type: -1 text: '<EOF>'

Tor

Mike Lischke

unread,
Sep 26, 2016, 11:01:04 AM9/26/16
to antlr-di...@googlegroups.com
Yes, that sounds more reasonable. Here is the printout when printing the type and text of the tokens. Seems like the beginning of the comment is not recognized, but taken as two '/' divide tokens.

Getting tokens 24
Token 0:  type: 7 text: '/'
Token 1:  type: 7 text: '/'
Token 2:  type: 3 text: ' '
Token 3:  type: 3 text: ' '

The comment rule is the only one not with all uppercase letters. See if it makes a difference naming it COMMENT. It's really weird that your lexer doesn't recognize the comments. See if you can get more details by stepping through the lexer simulator, with this input.

There must be something really special in your case. I have tried multiple grammars, all with comment lexer rules, and they all work as expected. Try also an existing grammar to see if that works at least.


tor.jer...@att.net

unread,
Sep 26, 2016, 11:54:37 AM9/26/16
to antlr-discussion

Mike,

I simplified the grammar a little bit and ran into an interesting situation.

If I use the following lexer  specification:

lexer grammar TestL;

options {
  language=Cpp;
}

NUMBER : ('0'..'9')+ ('.' ('0'..'9')+)? ;

COMMENT : '//' (' ' | 'a'..'z')* ('\r')? '\n' -> channel(HIDDEN) ;

WS :  (' ' | '\t' | '\r'| '\n') -> channel(HIDDEN) ;

STAR : '*';
DIV : '/';

on the input:

// this is a comment
   
5 / 3 * 5



It works. 

But if I change the (' ' | 'a'..'z')* back to .*? it fails as before.


Tor

tor.jer...@att.net

unread,
Sep 27, 2016, 7:28:54 PM9/27/16
to antlr-discussion
I think I have found the problem. In LexerATNSimulator.cpp:318, the method match is called as follows:

 if (trans->matches((int)t, std::numeric_limits<char32_t>::min(), std::numeric_limits<char32_t>::max())) {

Notice the last parameter, std::numeric_limits<char32_t>::max() - this will yield the max of an unsigned 32 bit variable. However, the signature of the code it calls in WildCardTransition.cpp:45 as shown below accepts parameters of type ssize_t.

bool WildcardTransition::matches(ssize_t symbol, ssize_t minVocabSymbol, ssize_t maxVocabSymbol)

In an x86 build this gets defined as int, which means that the max value passed in gets converted into a -1.

Thus the statement on line 46:

  return symbol >= minVocabSymbol && symbol <= maxVocabSymbol;

Will always fail. That's why wildcards never work in the new distro if you build for x86 (which I did for various legacy reasons).

Best regards,

Tor Jeremiassen

Mike Lischke

unread,
Sep 28, 2016, 3:12:18 AM9/28/16
to antlr-di...@googlegroups.com
Hi Tor,


I think I have found the problem. In LexerATNSimulator.cpp:318, the method match is called as follows:

 if (trans->matches((int)t, std::numeric_limits<char32_t>::min(), std::numeric_limits<char32_t>::max())) {

Hmm, indeed a tricky issue.


Notice the last parameter, std::numeric_limits<char32_t>::max() - this will yield the max of an unsigned 32 bit variable. However, the signature of the code it calls in WildCardTransition.cpp:45 as shown below accepts parameters of type ssize_t.

bool WildcardTransition::matches(ssize_t symbol, ssize_t minVocabSymbol, ssize_t maxVocabSymbol)

In an x86 build this gets defined as int, which means that the max value passed in gets converted into a -1.

Shouldn't that be the same for 64bit as well? Also there, a max unsigned is re-interpreted as min signed. What I wonder about however is:

- This code has been changed that way months ago. Why do you see a problem now?
- With all the warnings enabled in XCode, gcc/clang and VS (all at maximum), shouldn't there be a warning that an unsigned value is converted to a signed one implicitly? At least I got such warnings in other code places. Very weird.


Thus the statement on line 46:

  return symbol >= minVocabSymbol && symbol <= maxVocabSymbol;

Will always fail. That's why wildcards never work in the new distro if you build for x86 (which I did for various legacy reasons).

Have you tried to use numeric_limits<ssize_t>::max() instead? Does this work for you? In fact, this line tries to mimic the original Java code to an extend which is truly not necessary. We know it's trying to match the full allowed Unicode character range, which is fixed and will likely not change in the next years. So I believe this should be changed to matches(0, 0x10FFFF),


Loring Craymer

unread,
Sep 28, 2016, 5:07:28 AM9/28/16
to antlr-discussion
Mike--

I think that it would be more to the point to make sure that types are unsigned.  size_t (unsigned) seems more appropriate than ssize_t (signed) for the matches() method.  Java lacks unsigned int and unsigned long (by design--shudder), although some workarounds were introduced for Java 8.

--Loring

Mike Lischke

unread,
Sep 28, 2016, 5:14:16 AM9/28/16
to antlr-di...@googlegroups.com
> I think that it would be more to the point to make sure that types are unsigned. size_t (unsigned) seems more appropriate than ssize_t (signed) for the matches() method. Java lacks unsigned int and unsigned long (by design--shudder), although some workarounds were introduced for Java 8.

Loring, I'm absolutely with you on that. It's however quite some extra work to go from the omnipresent int in the Java runtime to proper signed/unsigned values in C++. You have to inspect every use case carefully. I did this already in many places, but didn't continue due to other work. Could well use some help with that...

Mike
--
www.soft-gems.net

tor.jer...@att.net

unread,
Sep 28, 2016, 7:08:25 PM9/28/16
to antlr-discussion
Mike,

In the short term, might it not be sufficient to change the signature of the "matches" methods to (int64_t, int64_t, int64_t). If I'm not wrong about it, the proper promotions should work from the 32 bit datatypes at the call sites to preserve the values and sign of the actual parameters.

Tor 

Loring Craymer

unread,
Sep 29, 2016, 2:36:00 AM9/29/16
to antlr-discussion
Mike--

I'm a bit busy, or I would offer to help.  I have done a lot of work on ANTLR internals, although I have not looked at ANTLR4--I have a variant of ANTLR derived from ANTLR3 for which I am currently debugging redundancy removal code for my linear GLL recognition algorithm.  When I get around to a C++ port, I will probably start with your code.

For this problem, I suggest making token types unsigned, starting with the AST token type field and access methods, then the generated token types, then doing the case-by-case which should take little thought--there are advantages to strong typing.  BitSet should also be edited to use unsigned--it is zero-based and heavily used in runtime code (at least pre-ANTLR4, and I can see no reason that that would have changed).  This should only take a day or two--once you get the code to compile again, it should pass testing after only a few minor tweaks.  At least that has been my experience in doing similar things with the ANTLR source.

--Loring

Mike Lischke

unread,
Sep 29, 2016, 3:23:56 AM9/29/16
to antlr-di...@googlegroups.com
>
> I'm a bit busy, or I would offer to help. I have done a lot of work on ANTLR internals, although I have not looked at ANTLR4--I have a variant of ANTLR derived from ANTLR3 for which I am currently debugging redundancy removal code for my linear GLL recognition algorithm. When I get around to a C++ port, I will probably start with your code.

Would be great, Loring. However, be prepared, the differences between ANTLR3 and 4 are non-trivial ;-)

>
> For this problem, I suggest making token types unsigned

That's one of the bigger problems. There is one negative token type: EOF. Only because of that the entire token type machinery must use signed types. It's a PITA. I hesitate to assign a different (positive) value to that, as it would break compatibility to all other targets. In other cases -1 is used to indicate non-initialized values (e.g. token positions) in otherwise always unsigned properties. I wonder if C++ users would accept a rule that says: whenever you see a -1 in Java ANTLR code use e.g. npos instead in C++?


> This should only take a day or two--once you get the code to compile again, it should pass testing after only a few minor tweaks. At least that has been my experience in doing similar things with the ANTLR source.

Unfortunately, the runtime test coverage is by far not 100%, so we might miss special cases like the one which led to this discussion, which you find only after weeks or even months. But the pure replacement is indeed not such a big deal. I'm more worried about breaking something.

Mike
--
www.soft-gems.net

Jim Idle

unread,
Sep 29, 2016, 3:51:17 AM9/29/16
to antlr-di...@googlegroups.com
On Thu, Sep 29, 2016 at 3:23 PM, 'Mike Lischke' via antlr-discussion <antlr-di...@googlegroups.com> wrote:
>
> I'm a bit busy, or I would offer to help.  I have done a lot of work on ANTLR internals, although I have not looked at ANTLR4--I have a variant of ANTLR derived from ANTLR3 for which I am currently debugging redundancy removal code for my linear GLL recognition algorithm.  When I get around to a C++ port, I will probably start with your code.

Would be great, Loring. However, be prepared, the differences between ANTLR3 and 4 are non-trivial ;-)

​:)
 

>
> For this problem, I suggest making token types unsigned

That's one of the bigger problems. There is one negative token type: EOF. Only because of that the entire token type machinery must use signed types. It's a PITA. I hesitate to assign a different (positive) value to that, as it would break compatibility to all other targets. In other cases -1 is used to indicate non-initialized values (e.g. token positions) in otherwise always unsigned properties. I wonder if C++ users would accept a rule that says: whenever you see a -1 in Java ANTLR code use e.g. npos instead in C++?

​I think it is totally fine to use different values for EOF in C++. There are no unsigned ints in Java, hence this type of thing happens. Personally, I think it would probably be OK to define EOF as token 0 and start real tokens after that. However, perhaps there are idioms in the source ​code as it stands that expect EOF to be negative? Shouldn't be really as it should be always be used symbolically without assuming anything about the value, but such things creep in. Maybe just try it and test it? 

I can't think of any time that token vocabs would be generated from say a Java lexer, then imported to a C++ parser, so I see no particular need to retain -1 as a token number. I could of course be missing something here ;)

I feel that you should take the C++ runtime towards "perfect C++", whatever that means, over time. Sometimes this is bound to mean that the internals of C++ are different to Java, but I think that is just implementation specific and we should not worry about it. 

 
>  This should only take a day or two--once you get the code to compile again, it should pass testing after only a few minor tweaks.  At least that has been my experience in doing similar things with the ANTLR source.

Unfortunately, the runtime test coverage is by far not 100%, so we might miss special cases like the one which led to this discussion, which you find only after weeks or even months. But the pure replacement is indeed not such a big deal. I'm more worried about breaking something.

​However, if we take the view that it is as tested as it can be given the tests available, then ​if you made this change and nothing broke in the tests, then asked current users to confirm it does not break their current grammars, then you are in no worse a situation as to knowledge of hidden edge case breakages as you are now. It's better to make a change like this earlier rather than later.

​I am tentatively looking forward to starting a project using the C++ target before too long. Hopefully, the project gets approved.

Jim

 

Mike
--
www.soft-gems.net

--
You received this message because you are subscribed to the Google Groups "antlr-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antlr-discussion+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Mike Lischke

unread,
Sep 29, 2016, 4:41:51 AM9/29/16
to antlr-di...@googlegroups.com
Hey Jim,


​I think it is totally fine to use different values for EOF in C++. There are no unsigned ints in Java, hence this type of thing happens. Personally, I think it would probably be OK to define EOF as token 0 and start real tokens after that. However, perhaps there are idioms in the source ​code as it stands that expect EOF to be negative? Shouldn't be really as it should be always be used symbolically without assuming anything about the value, but such things creep in. Maybe just try it and test it? 

Yes, you are probably right. I'm just leaning towards using a high value (max unsigned probably) instead of 0 for it, so we don't have to shift token values. We still have to compare between java and C++ sometimes.


I feel that you should take the C++ runtime towards "perfect C++", whatever that means, over time. Sometimes this is bound to mean that the internals of C++ are different to Java, but I think that is just implementation specific and we should not worry about it. 

Right, I see that the same way.


Loring Craymer

unread,
Sep 29, 2016, 8:10:40 AM9/29/16
to antlr-discussion
Actually, EOF can be coerced to be unsigned (0xffffffff), as can EPSILON.  It's not the value that matters, just that it be interpreted as unsigned.

The differences between ANTLR3 and ANTLR4 are pretty much irrelevant for my needs--the big changes will be to the recognizer machinery which I will discard anyway.  It's the things like file readers and other bits of infrastructure that I need, most of which fall under the dictum "If it ain't broke, don't fix it." and will have changed very little.

--Loring

Jim Idle

unread,
Sep 30, 2016, 3:44:14 AM9/30/16
to antlr-di...@googlegroups.com
On Thu, Sep 29, 2016 at 4:41 PM, 'Mike Lischke' via antlr-discussion <antlr-di...@googlegroups.com> wrote:
Hey Jim,


​I think it is totally fine to use different values for EOF in C++. There are no unsigned ints in Java, hence this type of thing happens. Personally, I think it would probably be OK to define EOF as token 0 and start real tokens after that. However, perhaps there are idioms in the source ​code as it stands that expect EOF to be negative? Shouldn't be really as it should be always be used symbolically without assuming anything about the value, but such things creep in. Maybe just try it and test it? 

Yes, you are probably right. I'm just leaning towards using a high value (max unsigned probably) instead of 0 for it, so we don't have to shift token values. We still have to compare between java and C++ sometimes.

​I initially thought that, but maxint can change over different architectures. These days that is probably unlikely, but that's why it is in limits.h of course. So I thought 0 might be better. You could always just ​make 1,000,000 I suppose and not care. The value isn't important.

Loring's right that it can be coerced of course, but better to avoid coersion if you can.

 


I feel that you should take the C++ runtime towards "perfect C++", whatever that means, over time. Sometimes this is bound to mean that the internals of C++ are different to Java, but I think that is just implementation specific and we should not worry about it. 

Right, I see that the same way.

​Thumbs up.

Loring Craymer

unread,
Sep 30, 2016, 7:43:38 AM9/30/16
to antlr-discussion
Actually, doing something like 
        unsigned int EOF = (unsigned) -1; 
is perfectly safe and will work anywhere.  The type cast does not affect the bit pattern (it would violate language standards if it did), just the value interpretation.  The type cast reinterprets -1 as max unsigned int.

--Loring
To unsubscribe from this group and stop receiving emails from it, send an email to antlr-discussi...@googlegroups.com.

Mike Lischke

unread,
Oct 3, 2016, 10:23:42 AM10/3/16
to antlr-di...@googlegroups.com
> Actually, doing something like
> unsigned int EOF = (unsigned) -1;
> is perfectly safe and will work anywhere. The type cast does not affect the bit pattern (it would violate language standards if it did), just the value interpretation. The type cast reinterprets -1 as max unsigned int.

I have now reworked many of the signed types to become unsigned ones (alts, rule indexes, EOF, EPSILON etc.). Interval and IntervalSet needed some extra handling and internally still work with signed types (in order to maintain the original sort order). This work is in Github now and you have to create a new antlr jar from this source repo and regenerate your parser with that to have everything in synch, if you wanna try this out.

Internal tests and the ANTLR runtime tests succeed still, so I'm quite confident the changes are fine.

Mike
--
www.soft-gems.net

Reply all
Reply to author
Forward
0 new messages