http://wiki.ecmascript.org/doku.php?id=strawman:support_full_unicode_in_strings&s=unicode#support_full_unicode_in_the_ecmascript_runtime_strings
It appears that TC39 is considering changing the underlying format of
JavaScript strings. Considering that the whole Web platform currently
uses (potentially malformed) UTF-16, changes in this area seem very
radical and need a broader review than TC39-internal work. (Hence, I'm
posting here.)
Personally, I think changing the underlying format of in-memory
strings on the Web is a bad idea. In retrospect, it almost always
turns out that using UTF-8 would have been the right call (go Rust!)
even where UTF-16 is used for historical reasons (mistakes of the
1990s). Yet, the proposal is to switch to UTF-32--the opposite
direction from UTF-8--and at this point, it's probably too late to
change to UTF-8 for in-memory strings.
Changing that DOM-internal representation to UTF-32 would be even more
wasteful of memory than UTF-16. It would also be a lot of work, so I
don't expect that to happen. Having an internal storage and mismatch
between the JavaScript engine and the rest of Gecko seems like a
performance problem.
Moreover, JavaScript strings can already represent all Unicode by
representing astral characters as surrogate pairs, so the switch would
not add expressiveness to JS strings. The main perceived benefit from
switching to UTF-32, AFAICT, would be being able to index by Unicode
character as opposed to being able to index by UTF-16 code unit, which
appears right as a matter principle. Yet, being able to index by
Unicode character isn't that useful. If you want to traverse the
whole string on a by-character basis, the language could provide
iterators instead of the programmer writing a loop with an index
variable. Furthermore, it's not even that's useful to work with a
Unicode string on a per-character basis. Since Unicode has combining
characters, characters don't correspond to the units the user sees as
atomic shapes (grapheme clusters). Thus, if you break strings naively
on character boundaries, you can still get bad results by ending up
splitting between combining characters.
It seems to me that switching from UTF-16 to UTF-32 would cause a lot
of problems we go to any real upside. What problem is the proposed
that the switch is meant to solve? As far as I can tell, Python's
principle-driven switch from UTF-16 to UTF-32 has caused nothing but
trouble and they don't have Web compatibility to deal with.
--
Henri Sivonen
hsiv...@iki.fi
http://hsivonen.iki.fi/