I'm fully aware that this may not be the best place for this, as it's
more a question for JavaScript itself than the V8 engine. But here
goes.
One of my projects at work involves a C++ program that can be
user-scripted - untrusted scripts that need to manipulate strings, and
a few basic aggregate types (mapping/dictionary and array/list,
implemented in JS using object and array). The C++ code works with
UTF-8 all the way, loading data from a PostgreSQL database, sending
stuff across TCP sockets, etc, etc, and I use the String constructor
and WriteUtf8 to get data into and out of JavaScript. So far, so good.
Everything works fine as long as all characters are in the BMP. But if
they're not, JavaScript's internal representation as UTF-16 starts to
be a problem. Suppose the script has this:
function first_two(s) {return s.substr(0,2);}
function remaining(s) {return s.substr(2);}
And you call each of those functions with a string constructed from
the following UTF-8 bytes:
"\xF0\x92\x8D\x85\x41\x41\x41"
That's three copies of the letter A, following a non-BMP character
(U+12345, which apparently is a cuneiform sign). The string has four
characters in it, so in theory, the first function should return the
astral character followed by a letter A, and the second function
should return "AA". But that's not what happens; the astral character
gets rendered as U+D808 U+DF45, which counts as two, so the first
function returns just one actual character, and the second returns the
three A's.
It gets worse when a character gets split. Do the same test with this
input byte stream: "\x41\xF0\x92\x8D\x85\x41\x41" - exactly the same,
but with one letter A moved to the front. Now the first function
returns U+0041 U+D808, and the second returns U+DF45 U+0041 U+0041.
Those codepoints then get rendered into UTF-8, representing *invalid
characters*, which any compliant parser (I was testing using the Pike
utf8_to_string() function) will throw out.
This is not an indictment of the V8 programmers. The JavaScript
specification is what's wrong. But I'm wondering if this might be a
place where an extension could be implemented, for the benefit of
embedded code that will never have to be executed using any other
interpreter, and then adoption might proceed from there.
Of course, the obvious way to fix the bug is to use UTF-32 / UCS-4 for
all strings, but that's fairly wasteful. An alternative that works
quite efficiently has been implemented by the Pike and Python
languages; conceptually, strings are stored in UTF-32, but in memory,
the leading 0 bytes are omitted if unnecessary. Each string has a
"width" of either 8, 16, or 32 (or if you prefer, 1, 2, or 4), based
on the highest codepoint in it. Python's string benchmark results
showed some operations slower under the new format, but others faster,
with the overall benchmark rating significantly improving (though the
exact improvement depends on myriad factors, of course); but more
importantly, string handling becomes *correct*.
This would be a potentially incompatible change to code. It may be
worth requiring some sort of token at the top of the script, same as
"use strict" - something like "use strict unicode" - to engage this
behaviour. Scripts depending on this would then still function in
other engines, but with the potential to break on non-BMP characters.
Some handy info on the subject:
http://unspecified.wordpress.com/2012/04/19/the-importance-of-language-level-abstract-unicode-strings/
http://www.python.org/dev/peps/pep-0393/ - the Python Enhancement
Proposal discussing the new string type (has lots of specifics but
also the concept discussion)
I'd love to see V8 lead the JavaScript world in true Unicode handling.
Use of PEP-393 strings (I'm in two minds as to whether they should be
called that or "Pike strings") would be a great step forward for the
whole world.
Chris Angelico