Unicode literal syntax thwarts common use cases for triple-quotes

195 views
Skip to first unread message

Jon Clark

unread,
Jun 15, 2011, 11:36:42 AM6/15/11
to scala...@googlegroups.com
I've encountered two use cases that I thought would be easily dealt with by using triple-quotes, but are instead interpreted as containing unicode literals:

1) Generating LaTeX:
println("""\usepackage{geometry}""")

2) Referencing a Windows path:
println("""c:\users""")

Both of these result in "ERROR: error in unicode escape"

While this seems reasonable for the content of a single quote, this doesn't seem like the desired behavior inside a triple-quoted string -- if one wants to use unicode characters there, it seems better to change the source file encoding.


My current (gross-looking) workaround is to use a unicode literal for the backslash:
println("""\u002Fusepackage{geometry}""")


Any thoughts as to why the grammar for triple-quotes is defined this way? Any better workarounds?

Cheers,
Jon

(Reference: This issue was mentioned briefly at http://www.scala-lang.org/node/2249)

Jason Zaugg

unread,
Jun 15, 2011, 12:26:11 PM6/15/11
to scala...@googlegroups.com
On Wed, Jun 15, 2011 at 4:36 PM, Jon Clark <jon.h...@gmail.com> wrote:
> I've encountered two use cases that I thought would be easily dealt with by
> using triple-quotes, but are instead interpreted as containing unicode
> literals:

It's worth noting that IntelliJ highlights the unicode escape a
different color within the string literal. The difference is even more
dangerous if you're trying to access some file: """C:\u1234""", that
doesn't trigger a compiler error.

-jason

Dave

unread,
Jun 15, 2011, 12:41:10 PM6/15/11
to scala-user
Looks like a bug to me.
Another workaround is splitting \ and u
scala> println("""c:\""" + """users""")
c:\users

scala> println("""\""" + """usepackage{geometry}""")
\usepackage{geometry}

Alex Cruise

unread,
Jun 15, 2011, 1:36:49 PM6/15/11
to Dave, scala-user
On Wed, Jun 15, 2011 at 9:41 AM, Dave <dave.mah...@hotmail.com> wrote:
Looks like a bug to me.

It's a feature, not a bug, alas.  Ever since the days of yore, the Java *lexer* has looked for \\u[0-9a-fA-F]{4} and done the translation before the *parser* ever gets to see it.  Yet another way Scala is bugwards-compatible.


-0xe1a

Jon Clark

unread,
Jun 15, 2011, 1:59:15 PM6/15/11
to scala...@googlegroups.com, Dave
I'm not convinced that Java "compatibility" is a justification. As long as things are compatible at the bytecode level, I'm perfectly happy.

Scala has historically abandoned Java traditions and even it's own previous design decisions (even at the expense of invalidating some code) if they seemed unreasonable though. I don't see any reason that Scala's compiler should be forever bound to the Java lexer should this be deemed a Bad Idea or Undesireable.

Jon Clark

unread,
Jun 15, 2011, 2:01:11 PM6/15/11
to scala...@googlegroups.com, Dave
The explanation of why this came to be is appreciated though. :-)

Tony Morris

unread,
Jun 15, 2011, 5:25:10 PM6/15/11
to scala...@googlegroups.com

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 16/06/11 03:59, Jon Clark wrote:
> Scala has historically abandoned Java traditions and even it's own
previous
> design decisions

er?

- --
Tony Morris
http://tmorris.net/

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk35IzYACgkQmnpgrYe6r60UrgCgj8ChLGez4vungm5gINklynvq
6E4AoL1nigupUHlByaqrFOVvrxoDs9OX
=55wX
-----END PGP SIGNATURE-----

Rex Kerr

unread,
Jun 15, 2011, 5:34:18 PM6/15/11
to Alex Cruise, scala-user
A good explanation--but that doesn't mean that triple-quoting, which doesn't happen in Java, should behave the same way.  The lexer has to understand triple-quotes, so it could also understand to leave \uxxxx alone just like it leaves \n and so on alone.

  --Rex

Alex Cruise

unread,
Jun 15, 2011, 6:50:08 PM6/15/11
to Rex Kerr, scala-user
On Wed, Jun 15, 2011 at 2:34 PM, Rex Kerr <ich...@gmail.com> wrote:
A good explanation--but that doesn't mean that triple-quoting, which doesn't happen in Java, should behave the same way.  The lexer has to understand triple-quotes, so it could also understand to leave \uxxxx alone just like it leaves \n and so on alone.

Well argued, and agreed.  Now, as for what's to be done... ;)

-0xe1a

Daniel Sobral

unread,
Jun 15, 2011, 7:06:44 PM6/15/11
to scala...@googlegroups.com
Worse than that! Try this:

val x = 1 // the use of \u000a inside a comment causes problems

As other said, but one may be excused to realize to which extent,
unicode expansion is done before anything else. Any place where \uXXXX
appears, that is replaced by the equivalent unicode character.

For example, the following is valid:

val x: String = \u0022A string.\u0022

Personally, I *hate* this. I admit it might have some obscure use,
like C trigraphs, but it is just not worth the trouble, IMHO.

Now, you can forgo unicode escape *completely*, with -Xno-uescape.
Unfortunately, it stops working even inside strings:

scala> val x = "\u000a"
<console>:1: error: invalid escape character
val x = "\u000a"
^

I wish one could go halfway: disable it everywhere *except* inside
single-double quote strings and characters.

--
Daniel C. Sobral

I travel to the future all the time.

Daniel Sobral

unread,
Jun 15, 2011, 7:08:54 PM6/15/11
to Rex Kerr, Alex Cruise, scala-user
Different places, actually. Try this:

val x = 1 // a comment with a \n embedded in it
val y = 2 // a comment with a \u000a embedded in it

It is not that \u is expanded inside strings: it is expanded inside source code.

--

Johannes Rudolph

unread,
Jun 16, 2011, 5:59:27 AM6/16/11
to Daniel Sobral, Rex Kerr, Alex Cruise, scala-user
This may very well increase Scala's credits in the international
obfuscation scene.

Valid Scala:
https://gist.github.com/1028966

Now we just need a program which generates ASCII-art which is at the
same time valid Scala code.

--
Johannes

-----------------------------------------------
Johannes Rudolph
http://virtual-void.net

Philippe Lhoste

unread,
Jun 17, 2011, 8:43:13 AM6/17/11
to scala...@googlegroups.com
On 15/06/2011 17:36, Jon Clark wrote:
> I've encountered two use cases that I thought would be easily dealt with by using
> triple-quotes, but are instead interpreted as containing unicode literals:

Interesting use cases. This is perverse as one can copy/paste these strings without
noticing the \ u sequence... Decomposition is a valid workaround, but again, you must have
spotted the issue beforehand... OK if the compiler complains against a malformed Unicode
literal, bad if you have the bad luck of having a valid one, like:

val f = """C:\test\uuuFeedUs""" // Not what you think if you are not aware of the issue...

BTW, I still haven't understood why multiple u's are accepted... (from Java tradition).

As said, the problem relies probably because an early phase of the parser just blindly try
and convert all escapes it sees in the source code, without caring about context. Fast,
but troublesome...

I never have been fan of these triple quotes anyway (also seen in Python, no?), somehow I
prefer heredoc syntax, with arbitrary end-of-string tag (seen in Perl, PHP...). They
probably come with their own set of issues, but at least they are more flexible.

Hey, at least it is better than Java's rigid syntax... :-)

--
Philippe Lhoste
-- (near) Paris -- France
-- http://Phi.Lho.free.fr
-- -- -- -- -- -- -- -- -- -- -- -- -- --

Jon Clark

unread,
Jun 17, 2011, 9:21:46 AM6/17/11
to scala...@googlegroups.com
Hi all,

There was enough response to this issue that I decided to create an issue in Scala's Jira for this. See https://issues.scala-lang.org/browse/SI-4706. I tried to (briefly) incorporate as much of the discussion here as possible. I used Daniel Sobral's suggestions as the proposed fix for this issue.

Cheers,
Jon

Jon Clark

unread,
Jun 17, 2011, 9:33:51 AM6/17/11
to scala...@googlegroups.com
Also, consider voting for the issue if you want it to catch the attention of the right people: https://issues.scala-lang.org/browse/SI-4706

(Apologies for the rapid-fire posts).

Tomás Lázaro

unread,
Jun 20, 2011, 11:49:04 AM6/20/11
to scala...@googlegroups.com
This doesn't solve the issue at hand but I think it is worth mentioning:

"""c:\user\something"""

could be better written as:

"c:/user/something"

This is a Java thing relating paths, '/' is accepted for Windows file paths.

You avoid both the triple quotes as well as the Unicode issue.

Philippe Lhoste

unread,
Jun 20, 2011, 12:35:22 PM6/20/11
to scala...@googlegroups.com

Yes, actually I do that in most of my programs, as double backslashes look terrible in
Java strings. But the triple quote has the advantage of taking a native Windows string (as
copy/pasted from the address bar, for example) and it is sad we hit such limitations.

And Windows paths aren't the only issue, as the OP rightfully shown in his initial
message... Beside LaTeX, we might need to manipulate registry paths (eg. to generate a
.ref file, for installation or other):
"""[HKEY_CLASSES_ROOT\Directory\shell\unlimitedApp]"""
or RTF:
"""Font styles: \b bold\b0 , \i italic\i0 , \ul underlined\ulnone ."""
and so on.

Reply all
Reply to author
Forward
0 new messages