Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Design question on PPC 0019 "quoted template strings" - To Sublex or not to Sublex

0 views
Skip to first unread message

Paul "LeoNerd" Evans

unread,
Jan 11, 2024, 10:45:04 AMJan 11
to Perl5 Porters
I'm looking at getting around to implementing PPC 0019 finally.

https://github.com/Perl/PPCs/blob/main/ppcs/ppc0019-qt-string.md

First interesting question: Should qt() strings be sub-lexed, or not..?

To explain this question, I'll first need to draw attention to an
annoying quirk of how existing strings like q() and qq() work.

When the lexer encounters a quote-start operator like q or qq, the
first thing it does is look at what the delimiting characters are, and
then it scans ahead looking for the end marker. While looking, it knows
how to count handed pairs *of that marker* and ignore escaped versions,
but it doesn't know anything else. Once it has found the bounds of that
string quoting form, it goes off into a separate parse phase to
understand the inner contents of it, which then get inserted at the
parse point.

q(this is the contents) and now we are outside

q(we can count (inner) parentheses) and now this is outside

q(we ignore \( escaped parens) and now this is outside

but that's as far as it goes. Note that it *does not* understand perl
code inside qq() strings.

eval: qq(This ${\ somefunc ')' } is not valid)
Compile error: Can't find string terminator "'" anywhere before EOF at
(eval 7) line 1.

What went wrong here?

Remember - the lexer first looks at the quoting marker, and then tries
to find the end. It found the end.

qq(This ${\ somefunc ')

################## ^-- Oh look here's the end.

That inside then gets passed into a sub-lexer to parse, and then gets
inserted back into the original syntax

qq(###################)' } is not valid)

Oops. Well, that definitely doesn't look like valid perl code - offhand
I don't know if the parse error comes from the sub-lex inside or the
main parse outside, but either way, it failed.


So with that in mind - what do we feel about the new qt() string syntax?

I.e. what do people feel -should- be the behaviour of a construction
like

sub f { ... }

say qt(Is this { f(")") } valid syntax?);

Should it:

1) Yield a parse error similar to the ones given in the example above?

2) Parse as valid perl code yielding a similar result to:

say 'Is this ', f(")"), ' valid syntax?';

3) Something else?


I feel that interpretation 2 might be most useful and powerful, but
would be inconsistent with existing behaviour of existing operators.
Interpretation 1 is certainly easier to achieve as it reüses existing
parser structures, but given the whole point is to interpolate code
inside the {braces} it might lead to weird annoying cases that don't
work so well.

Does anyone have any good examples one way or other from other
languages that have a similar construction?

(Cross-posted to https://github.com/Perl/PPCs/issues/47)

--
Paul "LeoNerd" Evans

leo...@leonerd.org.uk | https://metacpan.org/author/PEVANS
http://www.leonerd.org.uk/ | https://www.tindie.com/stores/leonerd/

Michael Conrad

unread,
Jan 11, 2024, 1:30:06 PMJan 11
to perl5-...@perl.org
On 1/11/24 10:30, Paul "LeoNerd" Evans wrote:
Should it:

  1) Yield a parse error similar to the ones given in the example above?

  2) Parse as valid perl code yielding a similar result to:

       say 'Is this ', f(")"), ' valid syntax?';

  3) Something else?


I feel that interpretation 2 might be most useful and powerful, but
would be inconsistent with existing behaviour of existing operators.

I would also agree that #2 is better for users of perl, but would be a significant burden to the implementors of syntax highlighting.  Currently those syntax highlighters get to take advantage of the same easy parsing.  If you force them to dive into a full perl parse they might have to re-structure their entire code to be able to recursively call into it.  You would end up in the short term with most editors not bothering to fix that, and then having misleading syntax highlighting which could confuse users worse than option #1 would have.

On the general topic of string interpolations, I did some recent exploration into this for CodeGen::Cpppp and decided that the nicest extension of string interpolation would be to make "${{ }}" parse as a code block.  It's more characters, but reads quite nicely and doesn't get in the way of code generation.  qt// would be fairly horrible for code generation if every { needs escaped.  Also, "${{ }}" could be added to the regular interpolation contexts and not break back-compat since it would have been an error, before.

Compared to the ppc 19 examples:

  # Simple scalar interpolation
  qt<Greetings, {$title} {$name}>;
  
  # Interpolation of method calls
  qt"Greetings, {$user->title} {$user->name}";
  
  # Interpolation of various expressions
  qt{It has been {$since{n}} {$since{units}} since your last login};
  
  qt{...a game of {join q{$"}, $favorites->{game}->name_words->@*}};

You would get

  # Simple scalar interpolation
  "Greetings, $title $name";
  
  # Interpolation of method calls
  "Greetings, ${{$user->title}} ${{$user->name}}";
  
  # Interpolation of various expressions
  "It has been $since{n} $since{units} since your last login";
  
  "...a game of ${{ $favorites->{game}->name_words->@* }}";

Michael Conrad

unread,
Jan 11, 2024, 1:45:05 PMJan 11
to perl5-...@perl.org
Oops.

On 1/11/24 13:26, Michael Conrad wrote:
> Also, "${{ }}" could be added to the regular interpolation contexts
> and not break back-compat since it would have been an error, before.

Actually existing syntax allows

   say "${{ a => \1 }->{a}}"

so my suggestion would indeed break back-compat.

Also my final example could have just been @{ }.   The ppc0019 examples
really aren't really showing why qt would be an advantage.  Better
examples would be to show infix operators and things like that.

Michael Conrad

unread,
Jan 11, 2024, 1:45:05 PMJan 11
to perl5-...@perl.org
On 1/11/24 13:35, Lukas Mai wrote:
> On 11.01.24 19:26, Michael Conrad wrote:
>> I would also agree that #2 is better for users of perl, but would be
>> a significant burden to the implementors of syntax highlighting. 
>> Currently those syntax highlighters get to take advantage of the same
>> easy parsing.  If you force them to dive into a full perl parse they
>> might have to re-structure their entire code to be able to
>> recursively call into it.  You would end up in the short term with
>> most editors not bothering to fix that, and then having misleading
>> syntax highlighting which could confuse users worse than option #1
>> would have.
>
> But you already can interpolate arbitrary code into strings:
>
>     "@{['just', 'an', 'example', 2+2]}"
>
> If you want to highlight that sensibly, you already need some sort of
> recursive embedding.
>
> (Also, JavaScript does #2 with its `... ${ ... } ...` construct and I
> don't hear developer tools complaining about that.)

Are you aware of highlighters that parse those inner expressions?  The
ones I've seen just highlight the whole string the same color and don't
bother.

Paul "LeoNerd" Evans

unread,
Jan 11, 2024, 2:00:05 PMJan 11
to Michael Conrad, perl5-...@perl.org
On Thu, 11 Jan 2024 13:26:12 -0500
Michael Conrad <mi...@nrdvana.net> wrote:

> I would also agree that #2 is better for users of perl, but would be
> a significant burden to the implementors of syntax highlighting. 
> Currently those syntax highlighters get to take advantage of the same
> easy parsing.  If you force them to dive into a full perl parse they
> might have to re-structure their entire code to be able to
> recursively call into it.  You would end up in the short term with
> most editors not bothering to fix that, and then having misleading
> syntax highlighting which could confuse users worse than option #1
> would have.

Having actually written a Perl syntax highlighter for multiple
use-cases including text editors [1], I can say that actually option #2
is _easier_. Trying to do that "find the end then sublex" is a harder
more complex structure to express in most grammar engines, than the
more regular structure of a recursive grammar.

[1]: https://github.com/tree-sitter-perl/tree-sitter-perl/

Michael Conrad

unread,
Jan 11, 2024, 2:30:05 PMJan 11
to perl5-...@perl.org, leo...@leonerd.org.uk
On 1/11/24 13:48, Paul "LeoNerd" Evans wrote:
> On Thu, 11 Jan 2024 13:26:12 -0500
> Michael Conrad <mi...@nrdvana.net> wrote:
>
>> I would also agree that #2 is better for users of perl, but would be
>> a significant burden to the implementors of syntax highlighting.
>> Currently those syntax highlighters get to take advantage of the same
>> easy parsing.  If you force them to dive into a full perl parse they
>> might have to re-structure their entire code to be able to
>> recursively call into it.  You would end up in the short term with
>> most editors not bothering to fix that, and then having misleading
>> syntax highlighting which could confuse users worse than option #1
>> would have.
> Having actually written a Perl syntax highlighter for multiple
> use-cases including text editors [1], I can say that actually option #2
> is _easier_. Trying to do that "find the end then sublex" is a harder
> more complex structure to express in most grammar engines, than the
> more regular structure of a recursive grammar.
>
> [1]: https://github.com/tree-sitter-perl/tree-sitter-perl/
>
Well, I stand corrected.  Actually Vim and highlight.js already think that

   "$test ${ '"' }"

is a valid string, while Scintilla (Scite, Notepad++) and VSCode
correctly detect the end of the string at the internal "

So vim and highlight.js wouldn't require any effort to handle qt{} with
embedded parsing, but scintilla and vscode might.


Philippe Bruhat (BooK)

unread,
Jan 11, 2024, 9:15:05 PMJan 11
to Perl5 Porters
On Thu, Jan 11, 2024 at 03:30:50PM +0000, Paul "LeoNerd" Evans wrote:
> I'm looking at getting around to implementing PPC 0019 finally.
>
> https://github.com/Perl/PPCs/blob/main/ppcs/ppc0019-qt-string.md
>
> First interesting question: Should qt() strings be sub-lexed,
> or not..?
>
> I.e. what do people feel -should- be the behaviour of a construction
> like
>
> sub f { ... }
>
> say qt(Is this { f(")") } valid syntax?);
>
> Should it:
>
> 1) Yield a parse error similar to the ones given in the example
> above?
>
> 2) Parse as valid perl code yielding a similar result to:
>
> say 'Is this ', f(")"), ' valid syntax?';
>
> 3) Something else?
>
>
> I feel that interpretation 2 might be most useful and powerful, but
> would be inconsistent with existing behaviour of existing operators.
> Interpretation 1 is certainly easier to achieve as it reüses existing
> parser structures, but given the whole point is to interpolate code
> inside the {braces} it might lead to weird annoying cases that don't
> work so well.
>

I definitely prefer 1) because I think the wording of the PPC implies
that qt-strings are parsed just like the other quoted constructs (look
for the end, skip escaped delimiters). It's also consistent with how
parsing of quoted constructs has been documented for the past 25+ years
(75e14d17912ce8a35d5c2b04c0c6e30b903ab97f in June 1998).

I'm so used to using ${\} and @{[]} in double-quoted strings that I
didn't really see the benefits of qt right away. It clearly declutters
complex expressions, though.

Thinking about how qt should be parsed, I think it should be consistent
with this paragraph from perldoc ("Gory details of parsing quoted
constructs"):

When searching for single-character delimiters, escaped delimiters
and "\\" are skipped. For example, while searching for terminating
"/", combinations of "\\" and "\/" are skipped. If the delimiters
are bracketing, nested pairs are also skipped. For example, while
searching for a closing "]" paired with the opening "[", combinations
of "\\", "\]", and "\[" are all skipped, and nested "[" and "]" are
skipped as well. However, when backslashes are used as the delimiters
(like "qq\\" and "tr\\\"), nothing is skipped. During the search for
the end, backslashes that escape delimiters or other backslashes are
removed (exactly speaking, they are not copied to the safe location).

and also that anything outside of {} should be treated as single-quoted
strings, meaning that all these produce the same result (the literal
`$bloop`):

'$bloop'
qt($bloop)
qt({'$bloop'})
qt({('$bloop'\)})

and like q(), the only character one needs to be escaped is the
delimiter.

We are quite used to, as Perl users, picking our delimiters, and being
careful about embedding them inside our single or double-quoted
strings. Escaping or balancing our delimiters is something we already
do commonly.

Expanding on your example by adding a newline to the string literal in
the embedded code, I think this would be valid syntax:

sub f { shift }
say qt(Is this { f("\)\n") } valid syntax?);

and would print

Is this )
valid syntax?

i.e. the code run in the template is `f(")\n")`.

Now the documentation in perldoc actually says that \\ is also skipped,
meaning you'd have to escape all \, which sounds less than ideal when
embedding code. (And my example above contradics this.)

If you consider the current behaviour around \ in single-quoted strings,
it is already a bit confusing:

$ perl -E 'say for q(\1), q(\\2), q(\\\3), q(\\\\4)'
\1
\2
\\3
\\4

My take on escaping in qt-strings would be that when search for
single-character delimiters, *only* the escaped delimiters are skippped.
Lone backslashes would be left alone.

So we'd have:

$ perl -E 'say for qt(\1), qt(\\2), qt(\\\3), qt(\\\\4)'
\1
\\2
\\\3
\\\\4

and

$ perl -E 'say for qt(\)), qt(\\)), qt(\\\)), qt(\\\\))'
)
\)
\\)
\\\)

Which I guess is actually your solution 3).

It should be possible to describe qt-string by saying that:

qt( prefix { ... } suffix )

runs exactly like this:

' prefix ' . do { ... } . ' suffix '

(and dies with "Unimplemented" in this case)

Actually, in the corner case where there's neither prefix nor suffix
(`qt({...})`), the `do` could be in list context and propagate it. So
it's really `scalar do {...}`. The PPC already says the code is run as a
scalar expression.

--
Philippe Bruhat (BooK)

Beauty may be a curse, but not as great a curse as stupidity.
(Moral from Groo The Wanderer #11 (Epic))

Paul "LeoNerd" Evans

unread,
Jan 12, 2024, 7:30:05 AMJan 12
to Smylers  \ via perl5-porters, Smylers
On Fri, 12 Jan 2024 11:32:05 +0100 (CET)
"Smylers  " via perl5-porters <perl5-...@perl.org> wrote:

> One stupid question that I couldn't see the answer to in your email or
> the PPC: what does the t stand for? It isn't an obvious mnemonic for
> ‘expression’ or ‘code block’ or ‘braces’.

I don't quite recall where it came from but I have a vague memory it
might be "quoted template".

> Raku evaluates {...} blocks in double-quoted strings, and goes for
> interpolation interpretation 2 — this works:
>
> say qq[Right square bracket is Unicode {ord(']')}.];
>
> Try it out: https://glot.io/snippets/gsdx63bezo
>
> I think that would be the behaviour of least surprise for users:
>
> • Somebody who hasn't thought about the issue would just write code
> like the above not noticing they have a potentially problematic
> embedded closing delimiter.
>
> • Somebody who realizes the potential issue would avoid it by picking
> a different delimiter — which would still work fine.

Yes it sound from descriptions that Ruby and JavaScript take similar
rules there, so we'd be in good company.

> Michael Conrad writes:
>
> > I would also agree that #2 is better for users of perl, but would
> > be a significant burden to the implementors of syntax highlighting.
> >
>
> I don't feel that should massively be taken into account. Syntax-
> highlighting Perl is already awkward, and if I'm writing something
> particularly esoterically nested, I don't necessarily expect syntax
> highlighters to get it right.

Plus I wouldn't be surprised if half of the existing highlighters fail
to correctly implement the rules on existing q/qq/etc.. anyway. ;) It
took us quite some effort in tree-sitter-perl to get the correct set of
behaviours, and that's from knowing a lot of the cornercase traps.


> Philippe Bruhat (BooK) writes:
>
> > I definitely prefer 1) because I think the wording of the PPC
> > implies that qt-strings are parsed just like the other quoted
> > constructs (look for the end, skip escaped delimiters). It's also
> > consistent with how parsing of quoted constructs has been
> > documented for the past 25+ years
> > (75e14d17912ce8a35d5c2b04c0c6e30b903ab97f in June 1998).
>
> Conversely, when adding a new feature, it's good to remedy any
> shortcomings in existing features. ‘You can embed a code block as an
> expression’ is simpler both to teach and market as a feature than ‘You
> can embed a code block as an expression, unless it happens to include
> the closing character for the surrounding string, in which case it
> might not work.’.
>
> Also, qt is distinct from existing quoting mechanisms in that it's the
> first one whose whole raison d'être is to interpolate a block with
> beginning and ending markers. As such, users may reasonably have
> expectations of what can be in it that isn't particularly influenced
> by what can be done inside other types of quotes.

Yes I'm inclined to agree. The entire point is to embed more complex
code structures and I think this kind of thing would come up often
enough to make people think about it more, as compared the case in q/qq
strings where it's very rarely of interest. I think about the only time
I'm ever aware of it is if I try to interpolate elements of a hash that
need quoting around the key names; e.g. this won't work:

say "My user name is $data->{"user-name"}";

Whereas my suggested interpretation 2, would permit

say qt"My user name is {$data->{"user-name"}}";

> > Thinking about how qt should be parsed, I think ... that anything
> > outside of {} should be treated as single-quoted strings
>
> I think that is indeed the intent of PPC 0019, which says in the
> Rationale section: “The proposed qt operator only looks for one
> special token in a string literal: {.‟
>
> However, I see the equivalence in the Specification section implies
> that the constant text A and B would be subject to qq interpretation.
> That contradiction in the PPC should probably be resolved before
> implementation starts.

Oooh, yes - another fine question that I hadn't spotted first time
around.

> (Technically it's only a contradiction if you read A and B as being
> placeholders for arbitrary strings. If you read them as simply being
> the exact strings 'A ' and ' B' then whether they are treated as q or
> qq strings is irrelevant, since they're the same in both.)
>
> But ... I'm concerned that not having \n interpreted as a line-break
> in qt strings could make them less useful.
...
> It also makes it harder to refactor old code to ‘upgrade’ from qq
> strings to qt strings.
...

Yes, a lot of interesting thoughts there that basically come down to
another set of choices, on how to handle the non-{} parts of the qt
string contents.

I think there's three options here, in order of size

1) Treat the characters exactly like q()

2) Treat the characters like the \X-aware parts of qq() but without
$... and @... interpolations

3) Treat the characters exactly like qq(), including variable
interpolations

Clearly the PPC doesn't intend for option 3 - it shouldn't support full
variable interpolations like qq(). There aren't any examples in the PPC
to distinguish options 1 or 2, but I think from the intention of the
part you quote where it says "equivalent to" an example using qq()
instead, suggests we probably should still support those escapes.

That is, an example like:

print qt(My name is { $self->name }\n);

would emit an actual newline sequence, rather than a literal
backslash-n combination.

While this does create yet another kind of weird quoting context that
has its own unique rules, that was already the case for qt() the moment
we picked the rules for {}. That brings the count up to at least four
that I can see - q(), qq(), m() and qt().

Philippe Bruhat (BooK)

unread,
Jan 12, 2024, 12:45:05 PMJan 12
to perl5-...@perl.org
On Fri, Jan 12, 2024 at 12:19:21PM +0000, Paul "LeoNerd" Evans wrote:
> On Fri, 12 Jan 2024 11:32:05 +0100 (CET)
> "Smylers  " via perl5-porters <perl5-...@perl.org> wrote:
>
> > Philippe Bruhat (BooK) writes:
> >
> > > I definitely prefer 1) because I think the wording of the PPC
> > > implies that qt-strings are parsed just like the other quoted
> > > constructs (look for the end, skip escaped delimiters). It's also
> > > consistent with how parsing of quoted constructs has been
> > > documented for the past 25+ years
> > > (75e14d17912ce8a35d5c2b04c0c6e30b903ab97f in June 1998).
> >
> > Conversely, when adding a new feature, it's good to remedy any
> > shortcomings in existing features. ‘You can embed a code block as an
> > expression’ is simpler both to teach and market as a feature than ‘You
> > can embed a code block as an expression, unless it happens to include
> > the closing character for the surrounding string, in which case it
> > might not work.’.

After discussing with Paul during this week's PSC meeting and reading
your email, I agree with you both. 2 is the better solution. There
were a bunch of issues I didn't think through.

The old quoting mechanisms will remain the same (for backwards
compatibility), but they shouldn't hold back a new and better one.

> Yes, a lot of interesting thoughts there that basically come down to
> another set of choices, on how to handle the non-{} parts of the qt
> string contents.
>
> I think there's three options here, in order of size
>
> 1) Treat the characters exactly like q()
>
> 2) Treat the characters like the \X-aware parts of qq() but without
> $... and @... interpolations
>
> 3) Treat the characters exactly like qq(), including variable
> interpolations
>
> Clearly the PPC doesn't intend for option 3 - it shouldn't support full
> variable interpolations like qq(). There aren't any examples in the PPC
> to distinguish options 1 or 2, but I think from the intention of the
> part you quote where it says "equivalent to" an example using qq()
> instead, suggests we probably should still support those escapes.

Yes, supporting existing character escapes (but not interpolation)
outside of the {} is the most useful option.

--
Philippe Bruhat (BooK)

When it is time for voting- / In the West or in the East-
Why must we always settle for- / The man we hate the least?
(Intro poem to Groo The Wanderer #108 (Epic))

Paul "LeoNerd" Evans

unread,
Jan 12, 2024, 4:15:04 PMJan 12
to Lukas Mai, perl5-...@perl.org
On Fri, 12 Jan 2024 19:37:20 +0100
Lukas Mai <lukasmai...@gmail.com> wrote:

> Also, you might want to exclude \Q, \L, \l, \U, \u, \F, \E from the
> set of supported backslash escapes because they represent dynamic
> string transformations that don't really make sense if variable
> interpolation isn't available (and their parsing rules are a mess,
> which is why I didn't bother in Quote::Code).

Yup, I was already thinking exactly that :)

I'll write up some more notes on the PPC doc and send it as a PR I
think.

Paul "LeoNerd" Evans

unread,
Jan 18, 2024, 6:30:06 PMJan 18
to perl5-...@perl.org, Lukas Mai, Philippe Bruhat (BooK), Smylers
On Fri, 12 Jan 2024 21:01:02 +0000
"Paul \"LeoNerd\" Evans" <leo...@leonerd.org.uk> wrote:

> I'll write up some more notes on the PPC doc and send it as a PR I
> think.

I've written a PR to clarify the rules about escapes in quoting.
Comments/votes welcome:

https://github.com/Perl/PPCs/pull/48

(/cc all the folks who have commented thusly)
0 new messages