Can I disable the default lexer rules?

401 views
Skip to first unread message

Sven Köhler

unread,
Jan 8, 2022, 7:33:35 PM1/8/22
to antlr-discussion
Hi,

first let me state what I mean by "default lexer rules". I do not have a single skip rule in my lexer grammer. Instead, I have a rule like

WS: [\p{White_Space}]+ -> channel(HIDDEN);

This rule works. For example all spaces, tabs, \r, and \n end up as tokens in the HIDDEN channel.

For the fun of it, I have added a \x1B (ESC) character between two \n \n in the input file that I'm parsing. I can not find this character in any of the tokens. In fact, I can see the two \n \n and there is no token in between them. Obviously, if I add an ordinary letter (e.g. an 'a') to my file, the parser complains that it did not expect that character. Yet the parser does not complain about the \x1B character. That is to be expected, as the parser won't see the token.

My conclusion is, that there is some sort of default behavior (rules?) that causes the \x1B character to be skipped by the lexer. 

What are those default rules that lead to the \x1B character being skipped? Are they documented anywhere? How can disable them, so that I can be sure that every input character ends up in some token on some channel?

My motivation is, that I wish the input file to be fully reconstructable from the sequence of tokens.

I already have tried to extend by whitespace rule to

WS: [\u0000-\u0020\p{White_Space}]+ -> channel(HIDDEN);

This worked, and the \x1B character ended up in a token along with the surrounding \n. However, I cannot be sure that there are more default rules that I missed.

Unfortunately, I could not find anything on that in the documentation.


Kind Regards,
  Sven



Ken Domino

unread,
Jan 9, 2022, 7:16:35 PM1/9/22
to antlr-discussion
I tested this on Linux+Windows, C#+Java+Go, Antlr4.9.3, "grammar Foobar; stuff : .*; WS: [\p{White_Space}]+ -> channel(HIDDEN);" and input as you say (checked with "od -t x1 -t c"). It gives a token recognition error on all six test combinations for the ESC. --Ken

Sven Köhler

unread,
Jan 10, 2022, 9:18:03 PM1/10/22
to antlr-di...@googlegroups.com

Hi Ken,

in my case, the target language is C++. I also see a token recognition error being printed in the terminal. But the parser doesn't report any error. Why is the token recognition error printed on the terminal but ultimately ignored by the parser?

Do I need some kind of catch all lexer rule? For now I have added CATCH_ALL: .+?; to the end of my lexer grammar and now the parser is reporting an error.

Kind Regards,
  Sven

Am 10.01.22 um 01:16 schrieb Ken Domino:
--
You received this message because you are subscribed to a topic in the Google Groups "antlr-discussion" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/antlr-discussion/GwZs-Gyb3l0/unsubscribe.
To unsubscribe from this group and all its topics, send an email to antlr-discussi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/antlr-discussion/8b368df5-d6cc-45a2-b552-2a6d1da8a669n%40googlegroups.com.

Ken Domino

unread,
Jan 11, 2022, 7:06:23 PM1/11/22
to antlr-discussion
You are right! It does not work for Cpp. Parsers should operate consistently across targets. This is a bug with the Antlr Cpp target (on Ubuntu/g++). I will debug to find out why it's not working.

Mike Lischke

unread,
Jan 12, 2022, 3:14:34 AM1/12/22
to antlr-di...@googlegroups.com
Hi,

I read this now 3 times and still am not 100% sure what the problem is. Sven, can you provide a grammar and sample input and list what you expect and what you get instead? Is it that an escape sequence is not recognised in a C++ lexer?


Am 12.01.2022 um 01:06 schrieb Ken Domino <ken.d...@gmail.com>:

You are right! It does not work for Cpp. Parsers should operate consistently across targets. This is a bug with the Antlr Cpp target (on Ubuntu/g++). I will debug to find out why it's not working.
On Monday, January 10, 2022 at 9:18:03 PM UTC-5 sven.k...@gmail.com wrote:

Hi Ken,

in my case, the target language is C++. I also see a token recognition error being printed in the terminal. But the parser doesn't report any error. Why is the token recognition error printed on the terminal but ultimately ignored by the parser?

Do I need some kind of catch all lexer rule? For now I have added CATCH_ALL: .+?; to the end of my lexer grammar and now the parser is reporting an error.

Kind Regards,
  Sven

Am 10.01.22 um 01:16 schrieb Ken Domino:
I tested this on Linux+Windows, C#+Java+Go, Antlr4.9.3, "grammar Foobar; stuff : .*; WS: [\p{White_Space}]+ -> channel(HIDDEN);" and input as you say (checked with "od -t x1 -t c"). It gives a token recognition error on all six test combinations for the ESC. --Ken

On Saturday, January 8, 2022 at 7:33:35 PM UTC-5 sven.k...@gmail.com wrote:
Hi,

first let me state what I mean by "default lexer rules". I do not have a single skip rule in my lexer grammer. Instead, I have a rule like

WS: [\p{White_Space}]+ -> channel(HIDDEN);

This rule works. For example all spaces, tabs, \r, and \n end up as tokens in the HIDDEN channel.

For the fun of it, I have added a \x1B (ESC) character between two \n \n in the input file that I'm parsing. I can not find this character in any of the tokens. In fact, I can see the two \n \n and there is no token in between them. Obviously, if I add an ordinary letter (e.g. an 'a') to my file, the parser complains that it did not expect that character. Yet the parser does not complain about the \x1B character. That is to be expected, as the parser won't see the token.

My conclusion is, that there is some sort of default behavior (rules?) that causes the \x1B character to be skipped by the lexer. 

What are those default rules that lead to the \x1B character being skipped? Are they documented anywhere? How can disable them, so that I can be sure that every input character ends up in some token on some channel?

My motivation is, that I wish the input file to be fully reconstructable from the sequence of tokens.

I already have tried to extend by whitespace rule to

WS: [\u0000-\u0020\p{White_Space}]+ -> channel(HIDDEN);

This worked, and the \x1B character ended up in a token along with the surrounding \n. However, I cannot be sure that there are more default rules that I missed.

Sven Köhler

unread,
Jan 12, 2022, 6:44:17 PM1/12/22
to antlr-di...@googlegroups.com

Hi,

first of all let me say that my original conclusion was completely wrong. There are no default lexer rules. I simply missed that a token recognition error is printed and ignored. So the lexer implicitly skips stuff that it does not know how to handle and the parser works on the remaining tokens.

Here's an example grammar:

lexer grammar TestLexer;
options {
    language = Cpp;
}
WHITESPACE: [\p{White_Space}]+ -> channel(HIDDEN);
KW_DO: 'do';

parser grammar TestParser;
options {
    language = Cpp;
    tokenVocab = TestLexer;
}
root: (KW_DO)* EOF;

The generated C++ parser will not report an error when parsing the data "do\n\x1B\ndo". The lexer merely prints some message about the token recognition error.

The character \x1B of the ASCII table is called "Escape" (a control character, which is not classified as whitespace). I simply played around with characters that were not part of my grammar and this was the first one that yielded some sort of error.

You can also use the string "abc" as an input. You will get three token recognition errors, and parsing will still finish without an issue.


Kind Regards,
  Sven


Am 12.01.22 um 09:14 schrieb 'Mike Lischke' via antlr-discussion:
--
You received this message because you are subscribed to the Google Groups "antlr-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antlr-discussi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/antlr-discussion/157DBB3A-E0E2-4F94-BA7F-3EEB7B0AF116%40googlemail.com.

Ken Domino

unread,
Jan 12, 2022, 8:21:36 PM1/12/22
to antlr-discussion
Mike, Sorry, I didn't get much sleep last night. I rechecked my analysis and I made a mistake. The generated parser works consistently across Cpp, CSharp, Java, Go. Antlr is rock solid here.

Sven, that is correct behavior. The lexer and parser are two different recognizers. In other words, you can have a lexer error and still get the parser to report no error because the token wasn't generated for the invalid character. You just have to test for errors in both recognizers, not just the parser. For trgen, I generate code for the drivers to check both, e.g., see https://github.com/kaby76/Domemtech.Trash/blob/99ed997500ff82b9a389615633dbdceef2621792/trgen/templates/Cpp/Program.cpp#L92

--Ken

Sven Köhler

unread,
Jan 13, 2022, 4:01:32 PM1/13/22
to antlr-di...@googlegroups.com

Hi Ken,

thanks for the confirmation that this is intended behavior.

What is the application of a lexer that continues to tokenize the input, even though it encountered unexpected input?

What would be a rule that best catches unexpected input? (see my other email for an attempt at an catch-all rule)


Kind Regards,
  Sven


Am 13.01.22 um 02:21 schrieb Ken Domino:

Ken Domino

unread,
Jan 14, 2022, 6:13:07 AM1/14/22
to antlr-discussion
Sven, I'm not sure what applications would have only a lexer, but good APIs allow things to be composed in ways unthought of, not monolithic. Even with the ISO C++XX preprocessor that I am writing, a parser is needed (e.g., for the '#if' expressions) because it's per Spec. I always define my own error listeners for the lexer and parser in order to keep track of the number of errors reported in each, because I've encountered the same issue as you. If one gets a lexer error, it seems like it should be reported as part of the overall failure in the parse, but it isn't. ConsoleErrorListener does not retain a count of errors https://github.com/antlr/antlr4/blob/master/runtime/CSharp/src/ConsoleErrorListener.cs; Parser keeps track of the number of errors reported independently of the error listener https://github.com/antlr/antlr4/blob/1b144fa7b40f6d1177c9e4f400a6a04f4103d02e/runtime/CSharp/src/Parser.cs#L680; and Lexer does not have a similar error counter https://github.com/antlr/antlr4/blob/1b144fa7b40f6d1177c9e4f400a6a04f4103d02e/runtime/CSharp/src/Lexer.cs#L558. I think your CATCH_ALL rule is fine. --Ken

Mike Lischke

unread,
Jan 14, 2022, 7:49:39 AM1/14/22
to ANTLR discussion group


Am 14.01.2022 um 12:13 schrieb Ken Domino <ken.d...@gmail.com>:

Sven, I'm not sure what applications would have only a lexer, but good APIs allow things to be composed in ways unthought of, not monolithic. Even with the ISO C++XX preprocessor that I am writing, a parser is needed (e.g., for the '#if' expressions) because it's per Spec. I always define my own error listeners for the lexer and parser in order to keep track of the number of errors reported in each, because I've encountered the same issue as you. If one gets a lexer error, it seems like it should be reported as part of the overall failure in the parse, but it isn't. ConsoleErrorListener does not retain a count of errors https://github.com/antlr/antlr4/blob/master/runtime/CSharp/src/ConsoleErrorListener.cs; Parser keeps track of the number of errors reported independently of the error listener https://github.com/antlr/antlr4/blob/1b144fa7b40f6d1177c9e4f400a6a04f4103d02e/runtime/CSharp/src/Parser.cs#L680; and Lexer does not have a similar error counter https://github.com/antlr/antlr4/blob/1b144fa7b40f6d1177c9e4f400a6a04f4103d02e/runtime/CSharp/src/Lexer.cs#L558. I think your CATCH_ALL rule is fine. --Ken

On Thursday, January 13, 2022 at 4:01:32 PM UTC-5 sven.k...@gmail.com wrote:

Hi Ken,

thanks for the confirmation that this is intended behavior.

What is the application of a lexer that continues to tokenize the input, even though it encountered unexpected input?

What would be a rule that best catches unexpected input? (see my other email for an attempt at an catch-all rule)


To add a few more points here:

1. The base idea with lexer errors is that they should not appear separately, but surface as parser errors.
2. Sometimes it is still useful to get lexer errors, to provide better error messages. The standard errors often say only: "I miss this token, I expected that token", but you can also tell the user, for instance, if a string is not properly terminated etc.
3. To re-sync the parser after a single missing or additional token, it is necessary that the lexer returns all recognised tokens. Re-synching is needed if the parser should continue reporting more errors. In Delphi the parser stopped at the very first error, you could fix it and if there was another error then this was then reported. But these days all parsers report all errors they can find (usually up to a maximum number).
4. Ken, I added an error counter for lexer errors in the C++ target, because I needed it and I think it's of general use.
5. Sven, a catch all rule as last one is perfectly fine to collect input for later reporting. The used token is not expected in the parser, so parser error reporting is not broken by that.

Regards,


Reply all
Reply to author
Forward
0 new messages