This.
The problem is that JavaScript strings are UCS-2, and Ruby strings are
ASCII bytes.
So we have to decide how we convert between them:
We treat JS strings as being full of bytes too, and ignore the upper 8
bits. This has the advantage that all Ruby strings can be converted to
JS.
Or, we treat Ruby strings as UTF-8, and convert between them as
necessary. This has the advantage that all JavaScript strings can be
converted to Ruby... and it assumes that anyone using multi-byte
characters in JS is probably doing the same in Ruby, and knows what
they're doing. But, it means we can only handle valid UTF-8 strings.
The third option is of course to translate byte-for-byte between the two
formats, so one JS character corresponds to two Ruby characters... but
while that will allow all valid characters in both environments, the
ruby-land strings would be near-useless.
Initially, we were going with Option 1, mostly due to lack of any
specific consideration. Someone tried to use multi-byte strings, and
complained that JavaScript wasn't seeing useful values; absent a
compelling reason to keep the previous behaviour, I changed it to
implement Option 2.
I guess another possibility would be to choose between the Option 1 and
Option 2 behaviours based on $KCODE or something.
The final option is the most complicated... try as hard as we possibly
can to avoid ever converting from one type of string to the other, and
instead provide duck-type-compatible String-like proxies.
Matthew