question about start and stop indices with unicode symbols

13 views
Skip to first unread message

Dan Welch

unread,
Feb 17, 2018, 9:26:38 PM2/17/18
to antlr-discussion
Hi, does anyone know if there is a way to get the lexer for an antlr4 grammar, say T.g4, to make the stop index at least one greater than the start index for matched (unicode) tokens such as 
𝒩 (math script pt: \u1D4A9)? Right now startIndex=stopIndex for these sorts of things. 

It's technically two points, right?: \uD835\uDCA9  (not really terribly knowledgeable when it comes to unicode).

Here's the lexer rule I'm using:
MATH_UNICODE_SYM : U_ARROW | U_OPERATOR | U_MATHSCRIPT;

//will add to this eventually
fragment
U_MATHSCRIPT: ('\u{1D49C}'|'\u{1D49E}'|'\u{1D4A9}')   //should be ('𝒜' | '𝒞' | '𝒩')

For background: I'm building an jetbrains IDE plugin that involves validating files in my language and I'm running into an issue where highlighting errors for tokens involving large valued unicode symbols such as the above screws up the offset on 
errors later in the file. See the attached sequence of pics.

For reference, a snippet of the IDEA code where the issue starts is below: 

//Issue is an object that comes from my language and contains an offendingToken. This method gets called for each Issue the tool produces.
public highlightIssueInEditor(Editor e, Issue issue) {    
    final TextAttributes attr = new TextAttributes();
    Token offendingToken = issue.msg.offendingToken;
    int a = offendingToken.getStartIndex();
    int b = offendingToken.getStopIndex() + 1;

    if (issue instanceof Error) {
      //set attr to boxed red 
      ...
    }
    else if (issue instanceof Warning) {
        //set attr to boxed orange
        ...
    }
    RangeHighlighter highlighter = 
        markupModel.addRangeHighlighter(a, b, 
                        HighlighterLayer.ERROR, attr, HighlighterTargetArea.EXACT_RANGE);

    ...//add the highlighter to the editor's markup model
}

In short, it highlights correctly if I increment 'b' by 2 for these special unicode symbols, though tokens appearing later in the file would screw up since their start (and stop indices) would then need to be adjusted to account for any additional space I add.

Seems like this would get finicky and messy real fast--hence it would be useful to know whether or not it's possible to  address the issue via changing tokens at the source.

I've taken a look at the unicode stuff added to antlr 4.7 in the docs section and I'm currently using 
CharStream afs = CharStreams.fromFileName(file.getAbsolutePath());  to retrieve the char stream. Though start=stop for these tokens right now.

Is this do-able or am I barking up the wrong tree?

01.png
02.png
03.png

Dan Welch

unread,
Feb 18, 2018, 1:32:55 AM2/18/18
to antlr-discussion
Hmm, I just checked this out with 4.6 and it seems like it does what I expected in that version.. For instance, I type

Def x : 𝒩

then call print out the tokens I see:
...
[@2,4:5='𝒩',<2>,1:4]

notice the start is not equal to the stop.
Reply all
Reply to author
Forward
0 new messages