Parsing Triple Quote Delimited Multi-Line Strings with Trailing Quotes?

243 views
Skip to first unread message

Michael Steindorfer

unread,
Mar 23, 2018, 4:33:16 AM3/23/18
to antlr-discussion
Hi all,

I'm trying parse multi-line string literals that start and end with triple double quotes `"""` (similar as found in languages like Kotlin and Scala). The string literals should allow up to two consecutive (unescaped) double quotes anywhere in the string body, like in:

  """
  Hello ""World""!
  """

I'll paste / link below a simplified grammar that comes pretty close to the requirement above, however doesn't work with trailing double quotes before the closing token `"""`. See some examples below:

  """""
  Additional double quotes in the beginning SUCCEED.
  """ 

  """
  Additional double quotes in the body "" or at the end of a line SUCCEED.""
  """

But:

  """
  Additional double quotes before the closing token FAIL!
  """""


Note that the multi-line strings are not purely lexical, since the full grammar does support also string interpolation. 
Therefore I would appreciate feedback that would help me supporting up to two consecutive trailing double quotes in the grammar structure that is already in place (i.e., where multi-line strings are defined in the lexical and also syntax part).

I appreciate your help.

Best regards,
Michael


---- ExampleLexer.g4 ----

lexer grammar ExampleLexer;

channels {
  WHITESPACE,
  COMMENTS
}

WS
  : [ \t\r\n\f]+ -> channel(WHITESPACE)
  ;

TRIPLE_QUOTE
  : '"""'  -> pushMode(MultiLineString)
  ;

mode MultiLineString;

END_TRIPLE_QUOTE
  : '"""' -> popMode
  ;

MLStringChars
  : (MLUnescapedDoubleQuotes? ~["\\$\r\n])+
  ;
MLNewline
  : '\r' '\n'? | '\n'
  ;
MLUnescapedDoubleQuotes
  : '"' '"'?
  ;


---- ExampleParser.g4 ----

parser grammar ExampleParser;

options {
  tokenVocab = ExampleLexer;
}

stringLiteral
  : TRIPLE_QUOTE multiLinePart* END_TRIPLE_QUOTE
  ;

multiLinePart
  : (ts+=MLStringChars | ts+=MLNewline | ts+=MLUnescapedDoubleQuotes)+
  ;

Loring Craymer

unread,
Mar 25, 2018, 7:14:09 AM3/25/18
to antlr-discussion
The trick here is to modify END_TRIPLE_QUOTE to consume up to two additional double quotes (so it matches 3, 4, or 5), then adjust later.

--Loring

Michael Steindorfer

unread,
Mar 28, 2018, 8:42:13 AM3/28/18
to antlr-discussion
Thanks Loring, you're suggestion was helpful and I got it to work.

However, as a side effect I had to add quite some code to adjust in the AST builder for the optional trailing quotes being part of the END_TRIPLE_QUOTE token.

Would it be possible, to define a lexer action which rejects END_TRIPLE_QUOTE from being recognised if it is followed by another double quote?
If yes then the trailing double quotes could be accepted by `multiLinePart` and the AST builder would become simpler again.
Unfortunately, my trials so far with writing a lexer action for this problem didn't convince ANTLR.

Thanks for your help.

-- Michael

Mike Lischke

unread,
Mar 28, 2018, 10:30:28 AM3/28/18
to antlr-di...@googlegroups.com
Hi Michael,


But:

  """
  Additional double quotes before the closing token FAIL!
  """""

No wonder, you haven't covered this case in your grammar. After the parser returned to normal mode (when it saw END_TRIPLE_QUOTE) it doesn't expect any other input. As the grammar is now, it ignores any additional input (you don't have added EOF to force using the full input and fail if there's more than what the parser can handle).

Instead modify your grammars so:

For the lexer add:

DOUBLE_QUOTE: '""';

in normal mode (e.g. before TRIPLE_QUOTE) For the parser change:

stringLiteral:
  TRIPLE_QUOTE multiLinePart* END_TRIPLE_QUOTE DOUBLE_QUOTE? EOF
;

Now the trailing double double quotes are accepted and everything is fine.

However, this suggestion is based on a very incomplete language specification. If it doesn't fit your needs you should list exactly what you expect the parser to do.


Reply all
Reply to author
Forward
0 new messages