String representation: why UTF-16?

551 views
Skip to first unread message

Chris Angelico

unread,
Dec 21, 2012, 10:22:46 AM12/21/12
to v8-u...@googlegroups.com
I'm fully aware that this may not be the best place for this, as it's
more a question for JavaScript itself than the V8 engine. But here
goes.

One of my projects at work involves a C++ program that can be
user-scripted - untrusted scripts that need to manipulate strings, and
a few basic aggregate types (mapping/dictionary and array/list,
implemented in JS using object and array). The C++ code works with
UTF-8 all the way, loading data from a PostgreSQL database, sending
stuff across TCP sockets, etc, etc, and I use the String constructor
and WriteUtf8 to get data into and out of JavaScript. So far, so good.

Everything works fine as long as all characters are in the BMP. But if
they're not, JavaScript's internal representation as UTF-16 starts to
be a problem. Suppose the script has this:

function first_two(s) {return s.substr(0,2);}
function remaining(s) {return s.substr(2);}

And you call each of those functions with a string constructed from
the following UTF-8 bytes:
"\xF0\x92\x8D\x85\x41\x41\x41"

That's three copies of the letter A, following a non-BMP character
(U+12345, which apparently is a cuneiform sign). The string has four
characters in it, so in theory, the first function should return the
astral character followed by a letter A, and the second function
should return "AA". But that's not what happens; the astral character
gets rendered as U+D808 U+DF45, which counts as two, so the first
function returns just one actual character, and the second returns the
three A's.

It gets worse when a character gets split. Do the same test with this
input byte stream: "\x41\xF0\x92\x8D\x85\x41\x41" - exactly the same,
but with one letter A moved to the front. Now the first function
returns U+0041 U+D808, and the second returns U+DF45 U+0041 U+0041.
Those codepoints then get rendered into UTF-8, representing *invalid
characters*, which any compliant parser (I was testing using the Pike
utf8_to_string() function) will throw out.

This is not an indictment of the V8 programmers. The JavaScript
specification is what's wrong. But I'm wondering if this might be a
place where an extension could be implemented, for the benefit of
embedded code that will never have to be executed using any other
interpreter, and then adoption might proceed from there.

Of course, the obvious way to fix the bug is to use UTF-32 / UCS-4 for
all strings, but that's fairly wasteful. An alternative that works
quite efficiently has been implemented by the Pike and Python
languages; conceptually, strings are stored in UTF-32, but in memory,
the leading 0 bytes are omitted if unnecessary. Each string has a
"width" of either 8, 16, or 32 (or if you prefer, 1, 2, or 4), based
on the highest codepoint in it. Python's string benchmark results
showed some operations slower under the new format, but others faster,
with the overall benchmark rating significantly improving (though the
exact improvement depends on myriad factors, of course); but more
importantly, string handling becomes *correct*.

This would be a potentially incompatible change to code. It may be
worth requiring some sort of token at the top of the script, same as
"use strict" - something like "use strict unicode" - to engage this
behaviour. Scripts depending on this would then still function in
other engines, but with the potential to break on non-BMP characters.

Some handy info on the subject:
http://unspecified.wordpress.com/2012/04/19/the-importance-of-language-level-abstract-unicode-strings/
http://www.python.org/dev/peps/pep-0393/ - the Python Enhancement
Proposal discussing the new string type (has lots of specifics but
also the concept discussion)

I'd love to see V8 lead the JavaScript world in true Unicode handling.
Use of PEP-393 strings (I'm in two minds as to whether they should be
called that or "Pike strings") would be a great step forward for the
whole world.

Chris Angelico

Joshua Bell

unread,
Dec 21, 2012, 5:52:42 PM12/21/12
to v8-u...@googlegroups.com
You should take a look at http://wiki.ecmascript.org/doku.php?id=harmony:unicode_supplementary_characters if you haven't, and look at the es-discuss archives https://mail.mozilla.org/listinfo/es-discuss for various discussions of improving Unicode handling in ES6.

The short version is that the next version of ECMAScript is gaining some capabilities to handle non-BMP code points more sensibly, but these will be rather limited and provide close to the bare minimum necessary for processing strings with "astral" data.

I realize that's somewhat orthogonal to your point which is about v8 internals, but ECMAScript itself is still firmly mired in the world of 16-bit code units. FWIW, Web APIs are also sticking with DOMStrings comprised of 16-bit code units.




Chris Angelico

--
v8-users mailing list
v8-u...@googlegroups.com
http://groups.google.com/group/v8-users

Chris Angelico

unread,
Dec 21, 2012, 6:13:41 PM12/21/12
to v8-u...@googlegroups.com
On Sat, Dec 22, 2012 at 9:52 AM, Joshua Bell <jsb...@chromium.org> wrote:
> You should take a look at
> http://wiki.ecmascript.org/doku.php?id=harmony:unicode_supplementary_characters
> if you haven't, and look at the es-discuss archives
> https://mail.mozilla.org/listinfo/es-discuss for various discussions of
> improving Unicode handling in ES6.

I'm glad there's discussion on the subject, at least! Of course, the
compatibility problems are very much there.
http://norbertlindenberg.com/2012/05/ecmascript-supplementary-characters/index.html#UTF32

> The short version is that the next version of ECMAScript is gaining some
> capabilities to handle non-BMP code points more sensibly, but these will be
> rather limited and provide close to the bare minimum necessary for
> processing strings with "astral" data.
>
> I realize that's somewhat orthogonal to your point which is about v8
> internals, but ECMAScript itself is still firmly mired in the world of
> 16-bit code units. FWIW, Web APIs are also sticking with DOMStrings
> comprised of 16-bit code units.

The main problem is backward compatibility. I'll see if I can join the
ES discussion (as if I don't already have more mailing lists than I
can keep up with!), but this is also an implementation issue. The
flexible string representation depends on strings being immutable, as
they are in both Python and Pike, and ECMAScript fits that too. It'd
be very efficient with handling the common case where a UTF-8 string
contains no bytes >0x7F, as the original string buffer can be used to
represent the string itself (assuming that it's owned by the right
subsystem, etc).

I'd like to see this as an openly backward-incompatible change. It's
the easiest way forward - acknowledge that the previous behaviour is
buggy, and make it possible to run a script in non-buggy mode.

ChrisA
Reply all
Reply to author
Forward
0 new messages