Solution: String Literals with Escape Characters (ANTLR 3.4; Java Target)

tille...@gmail.com

unread,

Jan 23, 2013, 7:02:28 PM1/23/13

to antlr-di...@googlegroups.com

Note 1: PLEASE POKE HOLES IN THIS SOLUTION!!!
Note 2: This is not all my work. It's a combination of stuff I found on stackoverflow and other corners of the Internet plus an idea I had to tie it all together.

Problem Statement: Lexer needs to provide string literals of the form "abcdef" PLUS escape sequences for \n, \r, and so on. When an escape sequence does not result in a translation to a "special character," it must return the exact character escaped (e.g. "\q" results in "q").

The best solution I found was from Trevor Robinson, Bruno Ranschaert, and Peter Štibraný (et alia), which used a Java StringBuilder object to "build up" the string literal character-by-character. Here is a link to the stackoverflow discussion: http://stackoverflow.com/questions/504402/how-to-handle-escape-sequences-in-string-literals-in-antlr-3 The solution is pasted below:

STRING
@init { final StringBuilder buf = new StringBuilder(); }
:
'"'
(
ESCAPE[buf]
| i = ~( '\\' | '"' ) { buf.appendCodePoint(i); }
)*
'"'
{ setText(buf.toString()); };

fragment ESCAPE[StringBuilder buf] :
'\\'
( 't' { buf.append('\t'); }
| 'n' { buf.append('\n'); }
| 'r' { buf.append('\r'); }
| '"' { buf.append('\"'); }
| '\\' { buf.append('\\'); }
);

Their (collective) solution worked- except in one case: when you wanted to process an escaped character that wasn't in the list for \(somechar) to just resolve to that character. For example "\z" should result in the letter "z" appended to the stringbuffer. Unless the escape character is in the list in the "fragment ESCAPE" lexer rule, it won't work (throws a mismatched token exception, if I recall correctly)

You can't add another alternative to the lexer rule to match any character because it conflicts with the other alts. And even if you could, the lexer built-in function "matchAny()" returns VOID. So adding an alt like:

anychar=(.)" { buf.append($anychar.text); }

won't work.

Here's how I worked around it:
1. I changed the "fragment ESCAPE" lexer rule to match backslash plus a single "any" character:

fragment ESCAPE[StringBuilder buf]
: '\\' .
{ processEscapeSeq(getText(), $buf); }
;

2. I added a "helper" function for the lexer in the "lexer::@members" section. It operates by examining the last character of the token text, and taking the same action as the lexer rule would have. Except that if it doesn't recognize the character it just adds it in as-is. Like so:

@lexer::members {
private void processEscapeSeq(String in_text, StringBuilder in_sb) {
if(in_text == null)
return;
if(in_text.length() < 2)
return;
char c = in_text.charAt(in_text.length()-1);
switch(c) {
case 'n' : in_sb.append("\n"); break;
case 'r' : in_sb.append("\r"); break;
case 't' : in_sb.append("\t"); break;
case 'b' : in_sb.append("\b"); break;
case 'f' : in_sb.append("\f"); break;
case '"' : in_sb.append("\""); break;
case '\'' : in_sb.append("'"); break;
case '/' : in_sb.append("/"); break;
case '\\' : in_sb.append("\\"); break;
default : in_sb.append(c);
}
}
}

I don't need to specify unicode literals (\u escape plus hex digits), but if I did I think the solution could be tweaked to handle it. Obviously, I could have done this with code in the lexer rule, but it seemed cleaner to have the helper function.

Jim Idle

unread,

Jan 23, 2013, 10:17:50 PM1/23/13

to antlr-di...@googlegroups.com

The best solution is:

STRING: '"' ~('\n'|'"')* ('"' | { issue unterminated literal error } ) ;

Do not try to coax the lexer into being a semantic versifier. Let it eat everything that looks vaguely OK, then move checking and errors as far down the tool chain as possible. Then you can issue:

Warning: (Line 4, offset 55) : The escape sequence \k has no meaning.

Instead of:

Lexer: unexpected: 'k'

You can do this because you can scan the text of the STRING token at any point in the tool chain - no need to stop parsing because of a mistake like this on behalf of your users.

Jim

--

Mike Lischke

unread,

Jan 24, 2013, 3:11:19 AM1/24/13

to antlr-di...@googlegroups.com

In addition to that I have a separate alt for the escape char and count it occurences. This number is then stored in the token and is used in the parser owner to see if post processing is necessary when retrieving the string value of that token. In addition I let the lexer remove the outer quote chars, which looks then like this (C target):

DOUBLE_QUOTED_TEXT

@init

{

int escape_count = 0;

int actual_start = $start;

}:

DOUBLE_QUOTE { actual_start = GETCHARINDEX(); }

(

DOUBLE_QUOTE DOUBLE_QUOTE { escape_count++; }

| ESCAPE_OPERATOR . { escape_count++; }

| ~(DOUBLE_QUOTE | ESCAPE_OPERATOR)

)*

{ $start = actual_start; EMIT(); LTOKEN->user1 = escape_count; }

DOUBLE_QUOTE

;

--

Mike

--

Mike Lischke, Principal Software Engineer

MySQL Developer Tools

Oracle Corporation, www.oracle.com

tille...@gmail.com

unread,

Jan 24, 2013, 12:12:48 PM1/24/13

to antlr-di...@googlegroups.com, mike.l...@oracle.com

>The best solution is:
>
>STRING: '"' ~('\n'|'"')* ('"' | { issue unterminated literal error } ) ;
>
>Do not try to coax the lexer into being a semantic versifier. Let it eat everything that looks vaguely OK, then move checking and errors as far down the tool chain as possible. Then you can issue:
>
>Warning: (Line 4, offset 55) : The escape sequence \k has no meaning.
>
>Instead of:
>
>Lexer: unexpected: 'k'
>
>You can do this because you can scan the text of the STRING token at any point in the tool chain - no need to stop parsing because of a mistake like this on behalf of your users.

>Jim

Jim,

I think I would prefer to have a "clean" string literal as early in the tool chain as possible so that none of the later points have to know that the string had ever needed scanning. That allows me to make a distinction between strings that needed escape processing because the user intended them to be that way (e.g. read in from a file that has no actual newlines but indicates their location with backslash-n) versus string literals in the code. Instead of every parser rule that takes STRING_LITERAL as a token needing to de-escape the string, they all are handed strings as neatly pre-processed tokens. Which I think is what the lexer ought to be doing.

I see your point, though. There ARE circumstances where you want to leave the decision to de-escape or not de-escape to a later point in the tool chain. So I added single-quoted string literals to my grammar, and those have no de-escaping performaed on them. Similar to the SQL standard or VB string literals, a single-quoted string literal begins and ends with the "tick" symbol (') and uses a sequence of two ticks in a row to indicate tick symbol WITHIN the literal. Unix shells do similar things. A single-quote string does not get $ expansion or escape processing, whereas a double-quote string does. My grammar calls RegEx functions a lot, so not having to use four backslashes to match a single backslash character is a real help.

The one takeaway I get from these discussion boards is that if you ask ten ANTLR developers a question, you'll get ten different answers, and none of them will agree with the answer you would have given them if they asked you the same question. Whoever said "great minds think alike" was an optimist!

Jim Idle

unread,

Jan 25, 2013, 12:02:52 AM1/25/13

to antlr-di...@googlegroups.com, mike.l...@oracle.com

Of course each specific use case has unique requirements - which is why all generic statements are generally useless (he says, in true Bertrand Russel style). ;)