Difference between current Spidermonkey and 1.8.5 string behavior?

Kent Williams

unread,

Mar 18, 2015, 11:27:14 AM3/18/15

to dev-tech-...@lists.mozilla.org

This has to do with comparing the Javascript 1.8.5 interpreter with the
current version (built off of
https://github.com/mozilla/gecko-dev/tree/GECKO3601_2015030504_RELBRANCH)
The file 'utf8file.txt' is here http://cornwarning.com/xfer/utf8file.txt
I've attached my test script -- if this listserve strips attachments,
it's here: http://www.cornwarning.com/xfer/test.js

The problem I'm running into is a mismatch between the 'native' string
type and UTF8 files. This has to do with custom file reader JSNative
functions that I won't go into here, but the test script illustrates the
problem:

In the old (1.8.5) version, the strings are listed to have a length of
21, and in the new version, they've got a length of 17. I assume that
this has something to do with the changes mentioned here:
https://blog.mozilla.org/javascript/2014/07/21/slimmer-and-faster-javascript-strings-in-firefox/

The problem I have is reconciling the new behavior with our own code
that expects the old behavior. We have a custom function that reads
lines from a file and creates JS::String objects from each line. Using
the 1.8.5 JS library, the resulting strings are equal when tested in
Javascript. With the current version they are not equal.

There's another minor issue: In 1.8.5 read(<filename>, "binary") returns
a string, and the current version returns an Uint8Array. Don't think
most people would run into that script-breaking change, but it rather
baffled me.

Here is the output of my script, and the rendering of UTF8 text might
not survive e-mailing:
js185 test.js
185
"Láttam hörcsögöt." length=21
"Láttam hörcsögöt." length=21
equal=true
Y values
4c c3 a1 74 74 61 6d 20
68 c3 b6 72 63 73 c3 b6
67 c3 b6 74 2e
Y Binary length=21
Y binary values
Type of ybin string
Láttam hörcsögöt.
4c c3 a1 74 74 61 6d 20
68 c3 b6 72 63 73 c3 b6
67 c3 b6 74 2e
[kwilliams@onderon ~]$ js36 test.js
185
"Láttam hörcsögöt." length=17
"Láttam hörcsögöt." length=17
equal=true
Y values
4c e1 74 74 61 6d 20 68
f6 72 63 73 f6 67 f6 74
2e
Y Binary length=21
Y binary values
Type of ybin object
[object Uint8Array]
4c c3 a1 74 74 61 6d 20
68 c3 b6 72 63 73 c3 b6
67 c3 b6 74 2e

--

*Kent Williams*| C++ Developer
*CourseLeaf from Leepfrog Technologies *

/“The Process of Academic Change”
/319-337-3877 | courseleaf.com <http://www.courseleaf.com/>

/This message contains confidential information and is intended only for
the individual named. If you are not the intended recipient of this
transmission or a person responsible for delivering it to the intended
recipient, any disclosure, copying, distribution, or other use of any of
the information in this transmission is strictly prohibited. Please
notify the sender immediately by e-mail if you have received this e-mail
by mistake and delete this e-mail from your system./

Jason Orendorff

unread,

Mar 18, 2015, 12:21:34 PM3/18/15

to Kent Williams, dev-tech-...@lists.mozilla.org

On Wed, Mar 18, 2015 at 10:25 AM, Kent Williams <kwil...@leepfrog.com>
wrote:

> This has to do with comparing the Javascript 1.8.5 interpreter with the
> current version (built off of https://github.com/mozilla/
> gecko-dev/tree/GECKO3601_2015030504_RELBRANCH)
> The file 'utf8file.txt' is here http://cornwarning.com/xfer/utf8file.txt
> I've attached my test script -- if this listserve strips attachments, it's
> here: http://www.cornwarning.com/xfer/test.js
>

Thanks for attaching complete code and data. It would have been impossible
to look into this without it.

> The problem I'm running into is a mismatch between the 'native' string
> type and UTF8 files. This has to do with custom file reader JSNative
> functions that I won't go into here, but the test script illustrates the
> problem:
>
> In the old (1.8.5) version, the strings are listed to have a length of 21,
> and in the new version, they've got a length of 17.

I think this just means that the old version was wrong and the new version
was right. Bytes and characters are two different things.

"Láttam hörcsögöt." is in fact 17 characters (count them). But it takes 21
bytes to write it out in UTF-8. Each character with a diacritic costs an
extra byte.

The old read() function in 1.8.5 turned each byte of input into a single
string character. That may be working for you, but strings built this way
are not going to work well with standard library features, all of which
assume that your strings are 16-bit. For example, you should be able to do
this:

js> print("Láttam hörcsögöt.".toUpperCase())
LÁTTAM HÖRCSÖGÖT.

In 1.8.5 I would expect the characters with diacritics not to be uppercased.

I assume that this has something to do with the changes mentioned here:
> https://blog.mozilla.org/javascript/2014/07/21/slimmer-
> and-faster-javascript-strings-in-firefox/
>

It's definitely not related.

The problem I have is reconciling the new behavior with our own code that
> expects the old behavior. We have a custom function that reads lines from a
> file and creates JS::String objects from each line. Using the 1.8.5 JS
> library, the resulting strings are equal when tested in Javascript. With
> the current version they are not equal.
>

I think you can change your custom function to convert the input from UTF-8
to UTF-16 using code like this:

char16_t *ucbuf =
JS::UTF8CharsToNewTwoByteCharsZ(cx,
JS::UTF8Chars(buf, len), &len).get();
if (!ucbuf) {
JS_ReportError(cx, "Invalid UTF-8 in file '%s'",
pathname);
gExitCode = EXITCODE_RUNTIME_ERROR;
return nullptr;
}
str = JS_NewUCStringCopyN(cx, ucbuf, len);
free(ucbuf);

That's all read() is doing.

-j

Kent Williams

unread,

Mar 18, 2015, 3:03:21 PM3/18/15

to dev-tech-...@lists.mozilla.org

Thanks for your help! It's ironic but it only really amounted to about
20 lines of code once I figured out the RIGHT 20 lines of code. Of
course I've been at it since Monday.

1. When making a new string in C++ to pass up to Javascript, instead of
calling JS_NewStringCopyZ:
char16_t
*ucbuf(JS::UTF8CharsToNewTwoByteCharsZ(cx,JS::UTF8Chars(str,len),&len).get();
if(ucbuf) {
JSString *rval = JS_NewUCSStringCopyN(cx,ucbuf,len);
free(ucbuf);
return rval;
}
return 0; // nlll on failure.

2. When getting a string from javascript in a JSNative function, instead
of JS_EncodeString(cx,jsstr);
use JS_EncodeStringToUTF8.

The 1.8.5 code seemed to store strings as UTF8 (if you told it to) but
if there were N-byte characters for N>1, the length of the string was
greater than the number of characters.
The new JS is more consistent -- the JavaScript world is all 2-byte
characters, and in JSNative C++ code, if you want to use UTF8 (which we
do) you have to convert from char16_t to UTF8 char * at the JS/C++ boundary.

> _______________________________________________
> dev-tech-js-engine mailing list
> dev-tech-...@lists.mozilla.org
> https://lists.mozilla.org/listinfo/dev-tech-js-engine