some minor string confusion :)

1,031 views
Skip to first unread message

ondras

unread,
Oct 2, 2008, 5:00:41 PM10/2/08
to v8-users
Hi again,

I have some troubles understanding all those String types in V8. What
exactly is the purpose and difference between v8::String::New,
v8::String::AsciiValue and v8::String::Utf8Value? How should I use
these and when?

Thanks for clarification,
Ondrej

Søren Gjesse

unread,
Oct 3, 2008, 2:59:47 AM10/3/08
to v8-u...@googlegroups.com
There is only one String type in V8 which is v8::String. You can create an new String in a number of ways with v8::String::New most commonly used. The classes  v8::String::Utf8Value and v8::String::Value (and v8::String::AsciiValue which is mainly for testing) are used to pull out the string as a char* or uint16_t* to be used in C++, e.g.:

  v8::Handle<v8::String> str = v8::String::New("print")
  v8::String::Utf8Value s(str);
  printf("%s", *s);

Note that v8::String represents the string value (ECMA-262 4.3.16). To create a string object (ECMA-262 4.3.18) use NewInstance on the String function.

Regards,
Søren

Pete Gontier

unread,
Oct 3, 2008, 9:29:59 PM10/3/08
to v8-u...@googlegroups.com
ECMA-262 4.3.16 allows a fair amount of encoding flexibility.

Has V8 committed to any particular encoding?


Pete Gontier <http://pete.gontier.org/>


Søren Gjesse

unread,
Oct 4, 2008, 9:27:04 AM10/4/08
to v8-u...@googlegroups.com
Inside V8 there is a number of different string representations. The basic ones are ascii representation (AsciiString) and two byte representation (TwoByteString) where the first is used when all characters are ASCII and therefore only one byte is required to store each character. Besides that V8 has concatenated strings (ConsString) and string slices (SlicedString). Concatenated strings points to two other strings which have been concatenated but the concatenated string is not materialized whereas a string slice points to a part of an existing string. V8 tries to make the best choice when making new strings and there are a number of rules to materialize (flatten) concatenated strings when certain operations are preformed. Finally there are also external strings in ascii and two byte variants (ExternalAsciiString and ExternalTwoByteString) these are strings which are not present in the V8 heap but references to strings in C++ land added through the API. In Chrome external strings are used when adding the JavaScript source code from web pages to V8 without making an additional copy.

Regards,
Søren

Pete Gontier

unread,
Oct 4, 2008, 1:53:09 PM10/4/08
to v8-u...@googlegroups.com
It sounds as if I didn't ask my question very well. Let me try again. I'm going to explain some things as if you didn't know them even though you obviously do just to make it clear what I'm asking about.

Every string has an encoding: UCS-2, ASCII, UTF-8, Shift JIS, UTF-16, etc. Unicode strings are also either composed or decomposed in one of several ways.

ECMA-262 4.3.16 doesn't specify an encoding for JavaScript strings. It specifies that strings are arrays of 16-bit integers. It doesn't specify semantics for those integers. It says each of these integers is "usually" UTF-16 (without suggesting a de/composition) but doesn't specify it.

Obviously, V8 is free to do whatever it likes with strings internally in order to get its job done. However, a couple of questions remain from an interface standpoint:

  • What encoding and de/composition can JavaScript programs expect? (I expect this will be dictated by the expectations of programs such as Gmail.)

  • What encoding and de/composition can clients of v8::String::Write, v8::String::ExternalStringResource, and v8::String::Value expect? (I expect this will be dictated by the expectations of programs such as Chrome.)

I am not a Unicode expert, so I recognize these questions may seem silly on some level.


Pete Gontier <http://pete.gontier.org/>


Pete Gontier

unread,
Oct 5, 2008, 11:07:24 PM10/5/08
to v8-users
I was spelunking the header just now and ran across some comments which made specific reference to UTF-16, so that's good. It would still be useful to know which de/composition to expect. It might seem needlessly specific, but because others have done it, it's useful to know.

Pete Gontier <http://pete.gontier.org/>


Pete Gontier

unread,
Oct 20, 2008, 7:13:45 PM10/20/08
to v8-users
My understanding of this stuff recently deepened a little bit after I read up on the behavior to be expected from strings and regular expressions. Consequently, a potential course of action has presented itself to me. I am not the world's foremost expert on these topics, so feel free to correct me on any aspect of the below.

An optimistic reading of ECMA-262 4.3.16 suggests that JavaScript can be expected to support UTF-16, even though it's not strictly required, and comments in the V8 headers suggest that strings are indeed UTF-16. However, in the real world, it turns out that JavaScript string functions and regular expressions are not required to support UTF-16, which can have surrogate pairs (multiple 16-bit quantities representing a single character). Because of this pre-existing condition, V8 is not in a position to do better, since this would break compatibility with other JavaScript engines.

The net effect here is that V8 (and all other JavaScript engines) actually supports UCS-2, which is a proper subset of UTF-16. (Both encodings support the Basic Multilingual Plane of Unicode.) This is mildly bad news, but I am happy to understand it, and things could be a lot worse. (I imagine ECMA knows about this problem and is thinking about a solution.)

So now my question is whether people expect to be able to use/store UTF-16 in JavaScript even though this cannot be expected to work reliably for anything beyond the simplest read/write cases. I'm pondering whether I'd be doing my customers (client developers) a favor by using iconv to convert all text to UCS-2 before handing it to V8. This would give me an opportunity to detect that the input characters cannot be converted to UCS-2 before they ever got into V8 and caused subtle problems, possibly much farther down the road when it would be difficult to figure them out.


Pete Gontier <http://pete.gontier.org/>


Pete Gontier

unread,
Oct 20, 2008, 7:13:45 PM10/20/08
to v8-users
My understanding of this stuff recently deepened a little bit after I read up on the behavior to be expected from strings and regular expressions. Consequently, a potential course of action has presented itself to me. I am not the world's foremost expert on these topics, so feel free to correct me on any aspect of the below.

An optimistic reading of ECMA-262 4.3.16 suggests that JavaScript can be expected to support UTF-16, even though it's not strictly required, and comments in the V8 headers suggest that strings are indeed UTF-16. However, in the real world, it turns out that JavaScript string functions and regular expressions are not required to support UTF-16, which can have surrogate pairs (multiple 16-bit quantities representing a single character). Because of this pre-existing condition, V8 is not in a position to do better, since this would break compatibility with other JavaScript engines.

The net effect here is that V8 (and all other JavaScript engines) actually supports UCS-2, which is a proper subset of UTF-16. (Both encodings support the Basic Multilingual Plane of Unicode.) This is mildly bad news, but I am happy to understand it, and things could be a lot worse. (I imagine ECMA knows about this problem and is thinking about a solution.)

So now my question is whether people expect to be able to use/store UTF-16 in JavaScript even though this cannot be expected to work reliably for anything beyond the simplest read/write cases. I'm pondering whether I'd be doing my customers (client developers) a favor by using iconv to convert all text to UCS-2 before handing it to V8. This would give me an opportunity to detect that the input characters cannot be converted to UCS-2 before they ever got into V8 and caused subtle problems, possibly much farther down the road when it would be difficult to figure them out.


Pete Gontier <http://pete.gontier.org/>



On Oct 5, 2008, at 8:07 PM, Pete Gontier wrote:

Christian Plesner Hansen

unread,
Oct 21, 2008, 4:45:08 AM10/21/08
to v8-u...@googlegroups.com
> An optimistic reading of ECMA-262 4.3.16 suggests that JavaScript can be
> expected to support UTF-16, even though it's not strictly required, and
> comments in the V8 headers suggest that strings are indeed UTF-16. However,
> in the real world, it turns out that JavaScript string functions and regular
> expressions are not required to support UTF-16, which can have surrogate
> pairs (multiple 16-bit quantities representing a single character). Because
> of this pre-existing condition, V8 is not in a position to do better, since
> this would break compatibility with other JavaScript engines.

In most cases the spec tells us to treat strings as UCS-2, including
most string operations like charAt and case conversion. This is not
optional, handling surrogate pairs would actually be incorrect
according to the spec. In a few cases (I can only think of 'eval' but
there may be more) the spec says to treat strings as UTF-16. Again,
this is not optional.

As you say, for compatibility reasons we would be reluctant to switch
any of the places we use UCS-2 to using UTF-16. However, for most
operations I think the switch could be made without breaking any code
on the web. For instance, JavaScriptCore uses UTF-16 for case
conversion and it doesn't seem to be an issue.

> So now my question is whether people expect to be able to use/store UTF-16
> in JavaScript even though this cannot be expected to work reliably for
> anything beyond the simplest read/write cases. I'm pondering whether I'd be
> doing my customers (client developers) a favor by using iconv to convert all
> text to UCS-2 before handing it to V8. This would give me an opportunity to
> detect that the input characters cannot be converted to UCS-2 before they
> ever got into V8 and caused subtle problems, possibly much farther down the
> road when it would be difficult to figure them out.

This is an application specific question, it's very hard to give a
general answer. If your program depends on string operations being
correct according to the unicode standard, for instance that surrogate
pairs are converted correctly to upper and lower case, then you're in
trouble if your program is written in JavaScript. However, most of
the language and even many string operations are unaffected by this,
and the operations that are affected still use a consistent and
reliable model -- it is just not the same as the unicode model.

Pete Gontier

unread,
Oct 21, 2008, 9:08:56 PM10/21/08
to v8-u...@googlegroups.com
Thanks for the insight and thanks in advance for tolerating my thinking out loud here.

The app in question is an application server in early development. When I say "customers (client developers)", I'm referring to the future. Happily, I'm not concerned about a large body of existing code. As well, I don't think I need to be concerned about militant JavaScript activists demanding UTF-16 in the few cases it's allowed.

So, on one hand, I may have an opportunity now to prevent some heart-ache and head-scratching, and I'm somewhat inclined to be a proactive paranoid gatekeeper and require every string coming in from the outside world to convert with full fidelity to UCS-2, even if there are some cases (such as 'eval') which would tolerate UTF-16.

On the other hand, I'm not so crazy as to think I want to implement every bit of this application server myself, and there may well be script libraries written primarily for use within web browsers which I would like to incorporate -- or anyway make it possible to incorporate. I suppose if the only strings they ever see are UCS-2, then they will work just fine, but if they have features which depend on UTF-16, those will break or cause breakage. I bet such features are few and far between, but I can't know conclusively. Hmmm.

I suppose one approach would be to use UCS-2 until someone complains. :-)


Pete Gontier <http://pete.gontier.org/>

Erik Corry

unread,
Oct 22, 2008, 3:45:15 AM10/22/08
to v8-u...@googlegroups.com
It's worth remembering that if you put UTF-16 into a JS string and then get the UTF-16 out again then you will not lose any data.  In a sense V8 is transparent to UTF-16.  It's only when you manipulate the string in JS in certain ways that you risk 'corruption'.  For example if you use substring to cut a string in the middle of a surrogate pair then the result will no longer be valid UTF-16.
--
Erik Corry, Software Engineer
Google Denmark ApS.  CVR nr. 28 86 69 84
c/o Philip & Partners, 7 Vognmagergade, P.O. Box 2227, DK-1018 Copenhagen K, Denmark.

Christian Plesner Hansen

unread,
Oct 22, 2008, 4:20:22 AM10/22/08
to v8-u...@googlegroups.com
Note also that you can't generally tell whether a program will behave
correctly under UCS-2. For instance, consider this program:

var dci = String.fromCharCode(0xD801) + String.fromCharCode(0xDC00);
var dli = dci.toLowerCase();
print(dci == dli);

(dci is a deseret capital I, represented by a surrogate pair). Under
UCS-2 this program prints true, under UTF-16 it prints false.
Programs like this cannote be detected reliably.

Pete Gontier

unread,
Oct 26, 2008, 2:19:08 PM10/26/08
to v8-u...@googlegroups.com
On Oct 22, 2008, at 12:45 AM, Erik Corry wrote:

It's worth remembering that if you put UTF-16 into a JS string and then get the UTF-16 out again then you will not lose any data.  In a sense V8 is transparent to UTF-16.  It's only when you manipulate the string in JS in certain ways that you risk 'corruption'.  For example if you use substring to cut a string in the middle of a surrogate pair then the result will no longer be valid UTF-16.

That's exactly the situation I'm pondering. One one hand, chopping a surrogate pair in half will create problems which will probably be difficult for most scripters to detect/comprehend/diagnose/handle, especially if the string gets passed around for a while and maybe combined with others before the problem has symptoms. On the other hand, denying savvy scripters the ability to store UTF-16 at all will probably frustrate some.

So far, when I import strings, there's an import object with a property representing the source encoding, and I've been assuming the destination encoding because I thought it should always be UTF-16. (Here's the example.) Perhaps I could have an additional property which specifies the destination encoding, and if it's absent, assume UCS-2, and if it's present, it can be either UCS-2 or UTF-16. That way, savvy scripters who really want to put UTF-16 into a JavaScript string have a way to do it, but the default behavior will be to assume the encoding, UCS-2, which is guaranteed to be free of surrogate pair subtleties.

 
– Pete Gontier <http://pete.gontier.org/>   



Pete Gontier

unread,
Oct 26, 2008, 2:19:12 PM10/26/08
to v8-u...@googlegroups.com
Too true. That's why I mentioned, though, that I do not have an existing body of code to support. You guys must work within Chrome and Chrome must work with squillions of existing web pages. So I understand why this would be a big consideration for you, but I suspect/hope that I have an opportunity here to document JavaScript's odd hybrid encoding approach to Unicode and steer people toward UCS-2 unless they really need UTF-16, and if they do then they may need to do extra work or at least be very careful to avoid logic which could cause them a lot of debugging time.
 
– Pete Gontier <http://pete.gontier.org/>   



Erik Corry

unread,
Oct 26, 2008, 3:03:42 PM10/26/08
to v8-u...@googlegroups.com
On Sun, Oct 26, 2008 at 7:19 PM, Pete Gontier <pe...@gontier.org> wrote:
 the default behavior will be to assume the encoding, UCS-2, which is guaranteed to be free of surrogate pair subtleties.

I don't understand what this could mean in practice.  If the input contains only basic plane (16 bit characters) then there is no difference between UCS-2 and UTF-16.  So in this case the flag would make no difference.  If the input contains characters from the 20 bit space then UCS-2 can't represent them so what will you do with them if the user specifies UCS-2 but has such characters.  I think throwing them away would be worse than just leaving them in there as surrogate pairs.  I suppose you could throw an exception but that seems worse too.
 

 
– Pete Gontier <http://pete.gontier.org/>   





Pete Gontier

unread,
Oct 26, 2008, 4:29:14 PM10/26/08
to v8-u...@googlegroups.com
On Oct 26, 2008, at 12:03 PM, Erik Corry wrote:

 the default behavior will be to assume the encoding, UCS-2, which is guaranteed to be free of surrogate pair subtleties.

I don't understand what this could mean in practice.  If the input contains only basic plane (16 bit characters) then there is no difference between UCS-2 and UTF-16.  So in this case the flag would make no difference.  If the input contains characters from the 20 bit space then UCS-2 can't represent them so what will you do with them if the user specifies UCS-2 but has such characters.  I think throwing them away would be worse than just leaving them in there as surrogate pairs.  I suppose you could throw an exception but that seems worse too.

I was planning to throw an exception.

Seems to me my choice here is between [1] doing nothing and allowing people to encounter subtle bugs in their own code and [2] being an annoying pedantic gatekeeper who forces people to explicitly request a potentially problematic situation. Neither option is perfect; the question is which is less bad.

The situation that concerns me most is that a team may write a lot of code which naively assumes JavaScript strings are UCS-2, because the team's native language fits into UCS-2, and maybe the language of their neighbors fits into UCS-2 as well, and by the time they realize their code has subtle problems processing UTF-16 text, their investment in their project is already too substantial to fix the problems, so they are forced, late in the development cycle, to abandon entire markets.

The exception would be a big unmistakable warning the very first time they attempt to use input text which doesn't fit into UCS-2 -- perhaps without realizing it -- before the problem has a chance to become tricky to diagnose. Yes, they can explicitly accept UTF-16 to inhibit the exception, but they had better know the rest of their code can actually process it, and they had better understand that they can't expect the built-in string and regexp facilities to help with that.

In short, my hope would be that the exception makes it easier to discover earlier that UTF-16 is a huge issue.

 
– Pete Gontier <http://pete.gontier.org/>   



Reply all
Reply to author
Forward
0 new messages