TC39 considering changes to JS string internals

Henri Sivonen

unread,

Apr 24, 2012, 7:17:13 AM4/24/12

to dev-platform

http://wiki.ecmascript.org/doku.php?id=strawman:support_full_unicode_in_strings&s=unicode#support_full_unicode_in_the_ecmascript_runtime_strings

It appears that TC39 is considering changing the underlying format of
JavaScript strings. Considering that the whole Web platform currently
uses (potentially malformed) UTF-16, changes in this area seem very
radical and need a broader review than TC39-internal work. (Hence, I'm
posting here.)

Personally, I think changing the underlying format of in-memory
strings on the Web is a bad idea. In retrospect, it almost always
turns out that using UTF-8 would have been the right call (go Rust!)
even where UTF-16 is used for historical reasons (mistakes of the
1990s). Yet, the proposal is to switch to UTF-32--the opposite
direction from UTF-8--and at this point, it's probably too late to
change to UTF-8 for in-memory strings.

Changing that DOM-internal representation to UTF-32 would be even more
wasteful of memory than UTF-16. It would also be a lot of work, so I
don't expect that to happen. Having an internal storage and mismatch
between the JavaScript engine and the rest of Gecko seems like a
performance problem.

Moreover, JavaScript strings can already represent all Unicode by
representing astral characters as surrogate pairs, so the switch would
not add expressiveness to JS strings. The main perceived benefit from
switching to UTF-32, AFAICT, would be being able to index by Unicode
character as opposed to being able to index by UTF-16 code unit, which
appears right as a matter principle. Yet, being able to index by
Unicode character isn't that useful. If you want to traverse the
whole string on a by-character basis, the language could provide
iterators instead of the programmer writing a loop with an index
variable. Furthermore, it's not even that's useful to work with a
Unicode string on a per-character basis. Since Unicode has combining
characters, characters don't correspond to the units the user sees as
atomic shapes (grapheme clusters). Thus, if you break strings naively
on character boundaries, you can still get bad results by ending up
splitting between combining characters.

It seems to me that switching from UTF-16 to UTF-32 would cause a lot
of problems we go to any real upside. What problem is the proposed
that the switch is meant to solve? As far as I can tell, Python's
principle-driven switch from UTF-16 to UTF-32 has caused nothing but
trouble and they don't have Web compatibility to deal with.

--
Henri Sivonen
hsiv...@iki.fi
http://hsivonen.iki.fi/

Dirkjan Ochtman

unread,

Apr 24, 2012, 7:26:42 AM4/24/12

to Henri Sivonen, dev-platform

On Tue, Apr 24, 2012 at 13:17, Henri Sivonen <hsiv...@iki.fi> wrote:
> It seems to me that switching from UTF-16 to UTF-32 would cause a lot
> of problems we go to any real upside. What problem is the proposed
> that the switch is meant to solve? As far as I can tell, Python's
> principle-driven switch from UTF-16 to UTF-32 has caused nothing but
> trouble and they don't have Web compatibility to deal with.

Not sure what you're referring to here. AFAIK, there's been a
compile-time switch for UTF-16 vs UCS-4 in older Python versions, and
newer versions have PEP 393 [1], where the internal encoding can vary
(mostly between UTF-8 and UCS-4, IIUC) depending on the contents.

Cheers,

Dirkjan

[1] http://www.python.org/dev/peps/pep-0393/

Henri Sivonen

unread,

Apr 24, 2012, 7:41:28 AM4/24/12

to dev-platform

On Tue, Apr 24, 2012 at 2:26 PM, Dirkjan Ochtman <dir...@ochtman.nl> wrote:
> On Tue, Apr 24, 2012 at 13:17, Henri Sivonen <hsiv...@iki.fi> wrote:
>> It seems to me that switching from UTF-16 to UTF-32 would cause a lot
>> of problems we go to any real upside. What problem is the proposed
>> that the switch is meant to solve? As far as I can tell, Python's
>> principle-driven switch from UTF-16 to UTF-32 has caused nothing but
>> trouble and they don't have Web compatibility to deal with.
>
> Not sure what you're referring to here. AFAIK, there's been a
> compile-time switch for UTF-16 vs UCS-4 in older Python versions

That's what I was referring to. I meant that the meaning of program
text written for UTF-16 Python changed when executed in an UTF-32
interpreter or vice versa. Even worse, UTF-16 Python and UTF-32 Python
continued (continue?) to co-exist for an extended period of time with
Debian shipping UTF-32 Python and Apple shipping UTF-16 Python.

Considering how slowly old IE versions gets replaced, UTF-32 JS would
similarly co-exist with UTF-16 JS for a long time if JS switched to
UTF-32.

However, after I wrote my previous message, I have been pointed to
http://norbertlindenberg.com/2012/03/ecmascript-supplementary-characters/
. If that blog post represents the current thinking in TC39 and the
wiki is stale, my concerns have already been addressed (though I find
it scary that a switch to UTF-32 was even seriously considered).

Joshua Cranmer

unread,

Apr 24, 2012, 10:13:11 AM4/24/12

to

On 4/24/2012 6:41 AM, Henri Sivonen wrote:

> However, after I wrote my previous message, I have been pointed to
> http://norbertlindenberg.com/2012/03/ecmascript-supplementary-characters/
> . If that blog post represents the current thinking in TC39 and the
> wiki is stale, my concerns have already been addressed (though I find
> it scary that a switch to UTF-32 was even seriously considered).

There is a post by roc
(<http://robert.ocallahan.org/2008/01/string-theory_08.html>), which can
be viewed as a semi-proposal to fake a UTF-16 requirement with UTF-8
under the hood. I suspect that if UTF-32 were adopted, people would more
quickly look into such emulation approaches.

David Herman

unread,

Apr 25, 2012, 5:01:19 PM4/25/12

to Henri Sivonen, dev-platform

On Apr 24, 2012, at 4:41 AM, Henri Sivonen wrote:

> However, after I wrote my previous message, I have been pointed to
> http://norbertlindenberg.com/2012/03/ecmascript-supplementary-characters/
> . If that blog post represents the current thinking in TC39

It does. And of course your input is always valued :)

> and the
> wiki is stale, my concerns have already been addressed (though I find
> it scary that a switch to UTF-32 was even seriously considered).

I don't really think the "big red switch" ever had much of a chance. V8 would've summoned the hounds. I opposed it too.

Dave