In most cases the spec tells us to treat strings as UCS-2, including
most string operations like charAt and case conversion. This is not
optional, handling surrogate pairs would actually be incorrect
according to the spec. In a few cases (I can only think of 'eval' but
there may be more) the spec says to treat strings as UTF-16. Again,
this is not optional.
As you say, for compatibility reasons we would be reluctant to switch
any of the places we use UCS-2 to using UTF-16. However, for most
operations I think the switch could be made without breaking any code
on the web. For instance, JavaScriptCore uses UTF-16 for case
conversion and it doesn't seem to be an issue.
> So now my question is whether people expect to be able to use/store UTF-16
> in JavaScript even though this cannot be expected to work reliably for
> anything beyond the simplest read/write cases. I'm pondering whether I'd be
> doing my customers (client developers) a favor by using iconv to convert all
> text to UCS-2 before handing it to V8. This would give me an opportunity to
> detect that the input characters cannot be converted to UCS-2 before they
> ever got into V8 and caused subtle problems, possibly much farther down the
> road when it would be difficult to figure them out.
This is an application specific question, it's very hard to give a
general answer. If your program depends on string operations being
correct according to the unicode standard, for instance that surrogate
pairs are converted correctly to upper and lower case, then you're in
trouble if your program is written in JavaScript. However, most of
the language and even many string operations are unaffected by this,
and the operations that are affected still use a consistent and
reliable model -- it is just not the same as the unicode model.
var dci = String.fromCharCode(0xD801) + String.fromCharCode(0xDC00);
var dli = dci.toLowerCase();
print(dci == dli);
(dci is a deseret capital I, represented by a surrogate pair). Under
UCS-2 this program prints true, under UTF-16 it prints false.
Programs like this cannote be detected reliably.
It's worth remembering that if you put UTF-16 into a JS string and then get the UTF-16 out again then you will not lose any data. In a sense V8 is transparent to UTF-16. It's only when you manipulate the string in JS in certain ways that you risk 'corruption'. For example if you use substring to cut a string in the middle of a surrogate pair then the result will no longer be valid UTF-16.
the default behavior will be to assume the encoding, UCS-2, which is guaranteed to be free of surrogate pair subtleties.
the default behavior will be to assume the encoding, UCS-2, which is guaranteed to be free of surrogate pair subtleties.I don't understand what this could mean in practice. If the input contains only basic plane (16 bit characters) then there is no difference between UCS-2 and UTF-16. So in this case the flag would make no difference. If the input contains characters from the 20 bit space then UCS-2 can't represent them so what will you do with them if the user specifies UCS-2 but has such characters. I think throwing them away would be worse than just leaving them in there as surrogate pairs. I suppose you could throw an exception but that seems worse too.