Hi, does anyone know if there is a way to get the lexer for an antlr4 grammar, say T.g4, to make the stop index at least one greater than the start index for matched (unicode) tokens such as
𝒩 (math script pt: \u1D4A9)? Right now startIndex=stopIndex for these sorts of things.
It's technically two points, right?: \uD835\uDCA9 (not really terribly knowledgeable when it comes to unicode).
Here's the lexer rule I'm using:
MATH_UNICODE_SYM : U_ARROW | U_OPERATOR | U_MATHSCRIPT;
//will add to this eventually
fragment
U_MATHSCRIPT: ('\u{1D49C}'|'\u{1D49E}'|'\u{1D4A9}') //should be ('𝒜' | '𝒞' | '𝒩')
For background: I'm building an jetbrains IDE plugin that involves validating files in my language and I'm running into an issue where highlighting errors for tokens involving large valued unicode symbols such as the above screws up the offset on
errors later in the file. See the attached sequence of pics.
For reference, a snippet of the IDEA code where the issue starts is below:
//Issue is an object that comes from my language and contains an offendingToken. This method gets called for each Issue the tool produces.
public highlightIssueInEditor(Editor e, Issue issue) {
final TextAttributes attr = new TextAttributes();
Token offendingToken = issue.msg.offendingToken;
int a = offendingToken.getStartIndex();
int b = offendingToken.getStopIndex() + 1;
if (issue instanceof Error) {
//set attr to boxed red
...
}
else if (issue instanceof Warning) {
//set attr to boxed orange
...
}
RangeHighlighter highlighter =
markupModel.addRangeHighlighter(a, b,
HighlighterLayer.ERROR, attr, HighlighterTargetArea.EXACT_RANGE);
...//add the highlighter to the editor's markup model
}
In short, it highlights correctly if I increment 'b' by 2 for these special unicode symbols, though tokens appearing later in the file would screw up since their start (and stop indices) would then need to be adjusted to account for any additional space I add.
Seems like this would get finicky and messy real fast--hence it would be useful to know whether or not it's possible to address the issue via changing tokens at the source.
I've taken a look at the unicode stuff added to antlr 4.7 in the docs section and I'm currently using
CharStream afs = CharStreams.fromFileName(file.getAbsolutePath()); to retrieve the char stream. Though start=stop for these tokens right now.
Is this do-able or am I barking up the wrong tree?