Hi Ken,
in my case, the target language is C++. I also see a token recognition error being printed in the terminal. But the parser doesn't report any error. Why is the token recognition error printed on the terminal but ultimately ignored by the parser?
Do I need some kind of catch all lexer rule? For now I have added CATCH_ALL: .+?; to the end of my lexer grammar and now the parser is reporting an error.
Kind Regards,
Sven
--
You received this message because you are subscribed to a topic in the Google Groups "antlr-discussion" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/antlr-discussion/GwZs-Gyb3l0/unsubscribe.
To unsubscribe from this group and all its topics, send an email to antlr-discussi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/antlr-discussion/8b368df5-d6cc-45a2-b552-2a6d1da8a669n%40googlegroups.com.
Am 12.01.2022 um 01:06 schrieb Ken Domino <ken.d...@gmail.com>:You are right! It does not work for Cpp. Parsers should operate consistently across targets. This is a bug with the Antlr Cpp target (on Ubuntu/g++). I will debug to find out why it's not working.On Monday, January 10, 2022 at 9:18:03 PM UTC-5 sven.k...@gmail.com wrote:Hi Ken,
in my case, the target language is C++. I also see a token recognition error being printed in the terminal. But the parser doesn't report any error. Why is the token recognition error printed on the terminal but ultimately ignored by the parser?
Do I need some kind of catch all lexer rule? For now I have added CATCH_ALL: .+?; to the end of my lexer grammar and now the parser is reporting an error.
Kind Regards,
Sven
Am 10.01.22 um 01:16 schrieb Ken Domino:
I tested this on Linux+Windows, C#+Java+Go, Antlr4.9.3, "grammar Foobar; stuff : .*; WS: [\p{White_Space}]+ -> channel(HIDDEN);" and input as you say (checked with "od -t x1 -t c"). It gives a token recognition error on all six test combinations for the ESC. --Ken
On Saturday, January 8, 2022 at 7:33:35 PM UTC-5 sven.k...@gmail.com wrote:
Hi,
first let me state what I mean by "default lexer rules". I do not have a single skip rule in my lexer grammer. Instead, I have a rule like
WS: [\p{White_Space}]+ -> channel(HIDDEN);
This rule works. For example all spaces, tabs, \r, and \n end up as tokens in the HIDDEN channel.
For the fun of it, I have added a \x1B (ESC) character between two \n \n in the input file that I'm parsing. I can not find this character in any of the tokens. In fact, I can see the two \n \n and there is no token in between them. Obviously, if I add an ordinary letter (e.g. an 'a') to my file, the parser complains that it did not expect that character. Yet the parser does not complain about the \x1B character. That is to be expected, as the parser won't see the token.
My conclusion is, that there is some sort of default behavior (rules?) that causes the \x1B character to be skipped by the lexer.
What are those default rules that lead to the \x1B character being skipped? Are they documented anywhere? How can disable them, so that I can be sure that every input character ends up in some token on some channel?
My motivation is, that I wish the input file to be fully reconstructable from the sequence of tokens.
I already have tried to extend by whitespace rule to
WS: [\u0000-\u0020\p{White_Space}]+ -> channel(HIDDEN);
This worked, and the \x1B character ended up in a token along with the surrounding \n. However, I cannot be sure that there are more default rules that I missed.
Hi,
first of all let me say that my original conclusion was
completely wrong. There are no default lexer rules. I simply
missed that a token recognition error is printed and ignored. So
the lexer implicitly skips stuff that it does not know how to
handle and the parser works on the remaining tokens.
Here's an example grammar:
lexer grammar TestLexer;
options {
language = Cpp;
}
WHITESPACE: [\p{White_Space}]+ -> channel(HIDDEN);
KW_DO: 'do';
parser grammar TestParser;
options {
language = Cpp;
tokenVocab = TestLexer;
}
root: (KW_DO)* EOF;
The generated C++ parser will not report an error when parsing the data "do\n\x1B\ndo". The lexer merely prints some message about the token recognition error.
The character \x1B of the ASCII table is called "Escape" (a control character, which is not classified as whitespace). I simply played around with characters that were not part of my grammar and this was the first one that yielded some sort of error.
You can also use the string "abc" as an input. You will get three token recognition errors, and parsing will still finish without an issue.
Kind Regards,
Sven
--
You received this message because you are subscribed to the Google Groups "antlr-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antlr-discussi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/antlr-discussion/157DBB3A-E0E2-4F94-BA7F-3EEB7B0AF116%40googlemail.com.
Hi Ken,
thanks for the confirmation that this is intended behavior.
What is the application of a lexer that continues to tokenize the input, even though it encountered unexpected input?
What would be a rule that best catches unexpected input? (see my other email for an attempt at an catch-all rule)
Kind Regards,
Sven
To view this discussion on the web visit https://groups.google.com/d/msgid/antlr-discussion/f18ae883-08c4-450f-8f4a-ba1fe59a20ddn%40googlegroups.com.
Am 14.01.2022 um 12:13 schrieb Ken Domino <ken.d...@gmail.com>:Sven, I'm not sure what applications would have only a lexer, but good APIs allow things to be composed in ways unthought of, not monolithic. Even with the ISO C++XX preprocessor that I am writing, a parser is needed (e.g., for the '#if' expressions) because it's per Spec. I always define my own error listeners for the lexer and parser in order to keep track of the number of errors reported in each, because I've encountered the same issue as you. If one gets a lexer error, it seems like it should be reported as part of the overall failure in the parse, but it isn't. ConsoleErrorListener does not retain a count of errors https://github.com/antlr/antlr4/blob/master/runtime/CSharp/src/ConsoleErrorListener.cs; Parser keeps track of the number of errors reported independently of the error listener https://github.com/antlr/antlr4/blob/1b144fa7b40f6d1177c9e4f400a6a04f4103d02e/runtime/CSharp/src/Parser.cs#L680; and Lexer does not have a similar error counter https://github.com/antlr/antlr4/blob/1b144fa7b40f6d1177c9e4f400a6a04f4103d02e/runtime/CSharp/src/Lexer.cs#L558. I think your CATCH_ALL rule is fine. --KenHi Ken,
thanks for the confirmation that this is intended behavior.
What is the application of a lexer that continues to tokenize the input, even though it encountered unexpected input?
What would be a rule that best catches unexpected input? (see my other email for an attempt at an catch-all rule)