antlr4-cpp: wrong line within BaseErrorListener::syntaxError() with unicode line breaks, e.g. \u2029

Jan Krause

unread,

Nov 16, 2016, 9:20:53 AM11/16/16

to antlr-discussion

at first:

thank you very very much for the cpp runtime of antlr4... it is great!

I have a problem with error handling of text files with unicode line breaks (e.g. \u2029). Within function syntaxError of my error listener (based on BaseErrorListener) I always get line 1 as line number for the syntax error. It seems that the antlr-runtime cannot deal with unicode line breaks. ?

----------------------------------

ExampleGrammar.g4:

grammar ExampleGrammar;

hello

   : 'HELLO'

WS

   : [ \r\n\t\u2029] + -> channel (HIDDEN)

---------------------

if i try to parse this text:

"\u2029 \u2029 what's up!"

I get a syntax error at line: 1 and charPositionInLine: 4

but correct would be line:3 and charPositionInLine: 1

what do i have to do?

cheers

Jan

Mike Lischke

unread,

Nov 16, 2016, 10:03:39 AM11/16/16

to antlr-di...@googlegroups.com

Hi Jan,

> at first:
> thank you very very much for the cpp runtime of antlr4... it is great!

Thanks :-)

>
> I have a problem with error handling of text files with unicode line breaks (e.g. \u2029). Within function syntaxError of my error listener (based on BaseErrorListener) I always get line 1 as line number for the syntax error. It seems that the antlr-runtime cannot deal with unicode line breaks. ?

Precisely. When you look in LexerATNSimulator.cpp line 624 (consume() method) you can see it only checks for a simple \n. Try adding the Unicode line break there too and see if that works out for you. If it does we can add it to the official C++ target (and also add more separators like page break, vertical separator etc.) to complete the Unicode support.

Mike
--
www.soft-gems.net

Jan Krause

unread,

Nov 17, 2016, 2:15:30 AM11/17/16

to antlr-discussion

Hi Mike,

thanks for your fast response and yes, it works now! My LexerATNSimulator::cosume() is now:

void LexerATNSimulator::consume(CharStream *input) {

  ssize_t curChar = input->LA(1);

  if ((curChar == '\n')

      || (curChar == 0x2028 /*'\u2028'*/)

      || (curChar == 0x2029 /*'\u2029'*/)

      || (curChar == 0x0085 /*'\u0085'*/)

      || (curChar == 0x000C /*'\u000C'*/)) {

    _line++;

    _charPositionInLine = 0;

  } else {

    _charPositionInLine++;

  input->consume();

I don't know whether this are all relevant unicode line breaks, string stuff is not really my domain. But anyway, thank You very much for the hint!

Jan

Jim Idle

unread,

Nov 17, 2016, 3:32:04 AM11/17/16

to antlr-discussion

In the C runtime I added a runtime API to set this. Might be worth a thought?

--
You received this message because you are subscribed to the Google Groups "antlr-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antlr-discussi...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Mike Lischke

unread,

Nov 17, 2016, 3:43:13 AM11/17/16

to antlr-di...@googlegroups.com

>
> In the C runtime I added a runtime API to set this. Might be worth a thought?

Hm, can you give me a pointer where I can find this? I don't remember having seen such an API.

Mike
--
www.soft-gems.net

Reply all

Reply to author

Forward