String interpolation, and the lexer

286 views
Skip to first unread message

Saar Korren

unread,
Feb 1, 2014, 9:13:44 AM2/1/14
to haxe...@googlegroups.com
First of all, a disclaimer: I'm no fan of string interpolation as a language feature, and really think it should be left to libraries. It makes the syntax unwieldy, as the remainder of the post will make evident. Nevertheless, I'm going to avoid ranting about it, and provide actual constructive comments for working with it.

The current stable release of HaXe parses interpolated strings by first parsing them as regular strings, and then running a second parsing phase on the string contents. This causes code like this to cause a syntax error:
trace('${'syntax error'}');
while code like this works fine:
trace('${"Works fine"}'); // Equivalent to trace("Works fine")

4 months ago, the lexer received a patch to "correct" this. The patch simply tries to detect the use of strings inside the "${}" block, and correctly "skip" them. This even correctly parses:
trace('${'par${'ses'} okay'}');

The problem is that it fails if curly braces appear outside of another string or comment:
trace('${{"This should";}+' work'}');
It may seem like odd syntax, but it shouldn't be invalid:
trace({"This does";}+' work');

This could probably be fixed by changing this:
    | '{' | '/' { store lexbuf; code_string lexbuf }
to this:
    | '{' { store lexbuf; code_string lexbuf; code_string lexbuf }
   
| '/' { store lexbuf; code_string lexbuf }
However, I'm not fluent enough in OCaml to know if that's correct, or utter BS. Either way, I don't even think it's the correct way to fix things.

The issue here is that string interpolation represents a language-in-a-language construct which forces running something that is parser-like within the lexer context. It forces the lexer to be partially stateful (equivalent to a stack-based FSM), making classical stateless lexers (equivalent to a stackless FSM) unusable.
If you do want to support it, however, then I think the biggest issue is the assumption that the entire formatted string should be output as a single lexeme. While I don't fully understand how the lexer produces its output in the OCaml code, I can tell that keywords, identifiers, and other syntax constructs are not processed inside the interpolated string.

In other words, this code:
trace('The sum of $a and $b is ${a+b} and ${if (a>b) a else b} is bigger');
Produces the lexical string:
IDENT("trace") O_PAREN("(") INTERP_STRING("The sum of $a and $b is ${a+b} and ${if (a>b) a else b} is bigger") C_PAREN(")") SEMICOLON(";")
(Actually, I wish it would output INTERP_STRING. Instead, it does something far uglier, which could break if any of the input files is larger than 1mb)

This then forces a second lexer to be run on each of the code blocks inside the interpolated string, followed by another parser run for each.

IMO, lexing should only be done once, and have the output:
IDENT("trace") O_PAREN("(") STRING("The sum of ") INTERP_IDENT("a") CONTINUE_STRING(" and ") INTERP_IDENT("b") CONTINUE_STRING(" is ") INTERP_START("${") IDENT("a") OP_ADD("+") IDENT("b") INTERP_END("}") CONTINUE_STRING(" and ") INTERP_START("${") KEYWORD_IF("if") WHITESPACE(" ") O_PAREN("(") IDENT("a") OP_GT(">") IDENT("b") C_PAREN(")") WHITESPACE(" ") IDENT("a") WHITESPACE(" ") KEYWORD_ELSE("else") WHITESPACE(" ") IDENT("b") INTERP_END("}") CONTINUE_STRING(" is bigger") C_PAREN(")") SEMICOLON(";")
(It's more verbose, but that's sort of the point of a lexer)

This would require the lexer to maintain state between returned lexemes, so it can properly parse a CONTINUE_STRING, or tell apart an INTERP_END("}") from a C_BRACE("}"). An interpolation/brace stack would have to be kept. I'm not entirely certain if this can be done in OCaml, but it can be done in other languages for hand-written lexers.
This would be trivial for a lexer written as a coroutine which yields lexemes, rather than one written as a function which returns them.

The up-side of doing things this way is that it makes things much easier for the parser. Instead of having to run a second lexer and parser on interpolated strings, a parser, which is already built for tree-like syntax, can simply run a rule like (excuse my BNF)
<interp_string> ::=
  STRING
<interp_continue>
<interp_continue> ::= <inter_part> | <interp_part> <interp_continue>
<interp_part> ::=
  CONTINUE_STRING |
  INTERP_IDENT |
  INTERP_START
<expression> INTERP_END

Nicolas Cannasse

unread,
Feb 2, 2014, 4:36:16 AM2/2/14
to haxe...@googlegroups.com
Le 01/02/2014 15:13, Saar Korren a écrit :
> First of all, a disclaimer: I'm no fan of string interpolation as a
> language feature, and really think it should be left to libraries. It
> makes the syntax unwieldy, as the remainder of the post will make
> evident. Nevertheless, I'm going to avoid ranting about it, and provide
> actual constructive comments for working with it.

If you wish to contribute, please send a pull request on GitHub with
corresponding unit tests.

Best,
Nicolas

Saar Korren

unread,
Feb 2, 2014, 5:36:32 AM2/2/14
to haxe...@googlegroups.com
You're assuming I can code in OCaml (I'm fairly sure I mentioned my "fix" was most likely BS). I did the next best things and posted an issue here:
https://github.com/HaxeFoundation/haxe/issues/2595

I can also post the solution I used in my own lexer, but I doubt it would help since it's written in Java.
Reply all
Reply to author
Forward
0 new messages