JS Interpreters and UTF8

Daniel Friesen

unread,

May 14, 2009, 8:38:03 PM5/14/09

to serv...@googlegroups.com

I just noticed something today about differing handling of utf8 in some
JS interpreters.

I saved a file as UTF8 (the system locale on my system).
-- test.js

print("♥".length);

--
(If the e-mail screws it up for some reason, the content of the string
is the solid unicode heart symbol. It's a single 3byte character.

Rhino `narwhal test.js` returns; 1
SpiderMonkey `js test.js` returns: 3

When I tried saving it as UTF16 both of them suffered fatal errors (not
related to the unicode character, removing it showed that they just had
issues reading the utf16, likely they we're expecting utf8).

So it seams at least two interpreters differ on what they do with
strings inside of a file. Both read utf8 through the file, but only
narwhal actually treats characters right.

Would anyone mind posting up some comparison with what v8 and
JavaScriptCore/SquirrelFish do?

I'm wondering if in MonkeyScript I should have .length and so on /fixed/
so that they treat characters as characters, rather than bytes as
characters. Or continue to consider bytes as characters and document it
that way implementing things like charlength or mb??? methods to handle
unicode properly.

--
~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://daniel.friesen.name]

Tom Robinson

unread,

May 14, 2009, 9:43:31 PM5/14/09

to serv...@googlegroups.com

On May 14, 2009, at 5:38 PM, Daniel Friesen wrote:

>
> I just noticed something today about differing handling of utf8 in
> some
> JS interpreters.
>
> I saved a file as UTF8 (the system locale on my system).
> -- test.js
>
> print("♥".length);
>
> --
> (If the e-mail screws it up for some reason, the content of the string
> is the solid unicode heart symbol. It's a single 3byte character.
>
> Rhino `narwhal test.js` returns; 1
> SpiderMonkey `js test.js` returns: 3

Narwhal (on Rhino, at least) assumes JavaScript modules are UTF-8
encoded, and thus explicitly uses UTF-8 encoding when reading them.

The Rhino shell, however, uses the platform's default encoding by
default when loading code. They recently added a "-encoding" option to
the shell to override it. See here: http://groups.google.com/group/mozilla.dev.tech.js-engine.rhino/browse_thread/thread/b6c5db11c5584749

So I think it doesn't really matter what the interpreters do in their
included shells, since our implementations can do whatever we want. If
it were up to me, I'd specify all JavaScript modules must be stored as
UTF-8 (also "That ServerJS programs be stored in UTF-8" was Wes's 3rd
proposed "promise" in the "ServerJS Character Sets" thread).

> When I tried saving it as UTF16 both of them suffered fatal errors
> (not
> related to the unicode character, removing it showed that they just
> had
> issues reading the utf16, likely they we're expecting utf8).
>
> So it seams at least two interpreters differ on what they do with
> strings inside of a file. Both read utf8 through the file, but only
> narwhal actually treats characters right.
>
> Would anyone mind posting up some comparison with what v8 and
> JavaScriptCore/SquirrelFish do?
>
> I'm wondering if in MonkeyScript I should have .length and so on /
> fixed/
> so that they treat characters as characters, rather than bytes as
> characters. Or continue to consider bytes as characters and document
> it
> that way implementing things like charlength or mb??? methods to
> handle
> unicode properly.

Yes, .length on a String should be the number of characters.

-Tom

Hannes Wallnoefer

unread,

May 15, 2009, 7:41:35 AM5/15/09

to serv...@googlegroups.com

2009/5/15 Daniel Friesen <nadir.s...@gmail.com>:

>
> I just noticed something today about differing handling of utf8 in some
> JS interpreters.
>
> I saved a file as UTF8 (the system locale on my system).
> -- test.js
>
> print("♥".length);
>
> --
> (If the e-mail screws it up for some reason, the content of the string
> is the solid unicode heart symbol. It's a single 3byte character.
>
> Rhino `narwhal test.js` returns; 1
> SpiderMonkey `js test.js` returns: 3

I can reproduce this. To me it looks like a bug in the standalone
Spidermonkey. The Firefox JS shell returns 1, and the same character
codes as Rhino.

Does any spidermonkey maven here have any ideas what may cause this?
Maybe an odd config/build switch?

Hannes

Ondrej Zara

unread,

May 15, 2009, 8:00:07 AM5/15/09

to serv...@googlegroups.com

Would anyone mind posting up some comparison with what v8 and
JavaScriptCore/SquirrelFish do?

V8 reports (correctly, IMHO) 1.

Ondrej

Patrick Mueller

unread,

May 15, 2009, 8:11:15 AM5/15/09

to serv...@googlegroups.com

via nitro_pie, JavaScriptCore prints 1.

On May 15, 2009, at 8:00 AM, Ondrej Zara wrote:

Would anyone mind posting up some comparison with what v8 and
JavaScriptCore/SquirrelFish do?

V8 reports (correctly, IMHO) 1.

Patrick Mueller - http://muellerware.org/

Wes Garland

unread,

May 15, 2009, 8:37:43 AM5/15/09

to serv...@googlegroups.com

Dan;

This is a bad test (or if it's a good test, you need to discuss your methodology further).

1. Did you have SpiderMonkey in UTF-8 C-Strings mode? It is not by default.
2. If you were using the Mozilla File Object, you need to open the file as binary in order to get it to process the characters as UTF-8
3. Determining interpreter behaviour by File I/O behaviour is wrong. The File Object needs to handle the character set conversion properly.
4. There is no defined handling for any character set in JavaScript, although source code must be mappable onto UTF-16.

If you're interested in developing Unicode compliance tests, around APIs conformant to Binary/B ByteStrings and File API Draft 4, I would be happy to make sure that they are valid and that GPSEE passes them. I just finished the character set-related methods and should have File I/O up and running soon.

Wes

--
Wesley W. Garland
Director, Product Development
PageMail, Inc.
+1 613 542 2787 x 102

Wes Garland

unread,

May 15, 2009, 8:54:34 AM5/15/09

to serv...@googlegroups.com

I just re-read Dan's message and realized he's saving a *program file*, not using File I/O. My comments still stand.

2009/5/15 Hannes Wallnoefer <han...@gmail.com>

Does any spidermonkey maven here have any ideas what may cause this?
Maybe an odd config/build switch?

Sure, this is easy. IMHO, the SpiderMonkey answer is just-as-right as the Rhino/etc answers. Why? SpiderMonkey is interpreting the heart as three distrinct characters, because it's operating in what I've previously described as "raw" mode. It sees three bytes, knows nothing about UTF-8, and so stores them in a String.

The fact that JavaScript source code has no defined character encoding is really at fault here. This is why I suggested last week that we define a default encoding of UTF-8 for ServerJS source files, although that suggestion was met with no support and plenty of detractors. It appears that the other interpreters are assuming that JavaScript source is UTF-8 whereas SpiderMonkey is assuming that JavaScript source code matches the behaviour of C strings. Both are ASSUMING, there is no standard beyond 7 bits.

The reason the firefox JS shell works is because the character encoding is something more sophisticated than "C strings". (What shell were you using? Jesse Ruderman's? Or something built into the browser?)

Hannes, I haven't tested this, but you should be able to fix this by putting SpiderMonkey into C-Strings-are-UTF-8 mode. I'm not sure when it changed -- probably 1.7/1.8 boundary, but it used to be a compiler #define and is now a runtime function, JS_CStringsAreUTF8(). That function affects the behaviour of how SpiderMonkey does multibyte to wide char conversion (mbtwcs) and vice-versa... what do they call it, inflate/deflate in the source code, I think. A read through JS_GetStringBytes() in jsstr.c (or .cpp for jsapi 1.8.1) the other also suggests that bleed-edge spidermonkey knows about UTF-16 beyond 16-bits..

https://developer.mozilla.org/en/SpiderMonkey/JSAPI_Reference/JS_GetStringBytes

mrogers

unread,

May 15, 2009, 10:29:10 AM5/15/09

to serverjs

I've been following this group intently, but I don't comment often.
As a developer who is looking forward to having these new standards
available, I wanted to throw in $0.02.

Everything that was said makes sense and certainly character encoding
is something to be considered carefully. But I think part of
developing a standard platform is providing a consistent environment
as well as a consistent API. If I'm deploying code that uses hard-
coded strings, the output of the length property should be consistent
across interpreters with all settings being default. As the
Spidermonkey example illustrates, we can't (and shouldn't) depend in
the different interpreters to ensure this consistency. I think that's
why some are arguing for making it a property of the ServerJS
standard. I vote for making UTF-8 standard and making it easy to
change if necessary. Note that I haven't really gone back and read up
on the reasons why people think this is a bad idea. But IMO it's
clearly the common case vs. the special case.

Here's an example of what I think is a plausible scenario. You have
your code deployed on a platform using Rhino and everything runs
fine. You're concerned about performance so you decide to try another
platform backed by Spidermonkey. You deploy your code and run your
tests and half of them break due to encoding issues. Some would argue
that it would be fairly easy to determine why things were screwy and
make the appropriate config changes to Spidermonkey. But I've learned
that it doesn't pay to try and anticipate the problems that someone
else will run into. You can't catch all edge cases with a standard.
So it's more important to be consistent, in API behavior as well as
environment.

I think this group is doing an awesome job, and I'm really excited
about the progress.

:Marco

On May 15, 8:54 am, Wes Garland <w...@page.ca> wrote:
> I just re-read Dan's message and realized he's saving a *program file*, not
> using File I/O. My comments still stand.
>

> 2009/5/15 Hannes Wallnoefer <hann...@gmail.com>

>
> > Does any spidermonkey maven here have any ideas what may cause this?
> > Maybe an odd config/build switch?
>
> Sure, this is easy. IMHO, the SpiderMonkey answer is just-as-right as the
> Rhino/etc answers. Why? SpiderMonkey is interpreting the heart as three
> distrinct characters, because it's operating in what I've previously
> described as "raw" mode. It sees three bytes, knows nothing about UTF-8,
> and so stores them in a String.
>
> The fact that JavaScript source code has no defined character encoding is
> really at fault here. This is why I suggested last week that we define a
> default encoding of UTF-8 for ServerJS source files, although that
> suggestion was met with no support and plenty of detractors. It appears
> that the other interpreters are assuming that JavaScript source is UTF-8
> whereas SpiderMonkey is assuming that JavaScript source code matches the
> behaviour of C strings. Both are ASSUMING, there is no standard beyond 7
> bits.
>
> The reason the firefox JS shell works is because the character encoding is
> something more sophisticated than "C strings". (What shell were you using?
> Jesse Ruderman's? Or something built into the browser?)
>

> Hannes, *I haven't tested this*, but you should be able to fix this by

> putting SpiderMonkey into C-Strings-are-UTF-8 mode. I'm not sure when it
> changed -- probably 1.7/1.8 boundary, but it used to be a compiler #define
> and is now a runtime function, JS_CStringsAreUTF8(). That function affects
> the behaviour of how SpiderMonkey does multibyte to wide char conversion
> (mbtwcs) and vice-versa... what do they call it, inflate/deflate in the
> source code, I think. A read through JS_GetStringBytes() in jsstr.c (or
> .cpp for jsapi 1.8.1) the other also suggests that bleed-edge spidermonkey
> knows about UTF-16 beyond 16-bits..
>

> https://developer.mozilla.org/en/SpiderMonkey/JSAPI_Reference/JS_GetS...

Kevin Dangoor

unread,

May 15, 2009, 1:10:03 PM5/15/09

to serv...@googlegroups.com

On Fri, May 15, 2009 at 10:29 AM, mrogers <marco....@gmail.com> wrote:
> Everything that was said makes sense and certainly character encoding
> is something to be considered carefully. But I think part of
> developing a standard platform is providing a consistent environment
> as well as a consistent API. If I'm deploying code that uses hard-
> coded strings, the output of the length property should be consistent
> across interpreters with all settings being default. As the
> Spidermonkey example illustrates, we can't (and shouldn't) depend in
> the different interpreters to ensure this consistency. I think that's
> why some are arguing for making it a property of the ServerJS
> standard. I vote for making UTF-8 standard and making it easy to
> change if necessary.

FWIW, Python has a mechanism to deal with the encoding of source
files. We could copy that.

I agree that we wouldn't want character encoding of files to hamper
module sharing.

Kevin

--
Kevin Dangoor

work: http://labs.mozilla.com/
email: k...@blazingthings.com
blog: http://www.BlueSkyOnMars.com

Robert Koberg

unread,

May 15, 2009, 1:51:51 PM5/15/09

to serv...@googlegroups.com

On May 15, 2009, at 1:10 PM, Kevin Dangoor wrote:

>
> On Fri, May 15, 2009 at 10:29 AM, mrogers <marco....@gmail.com>
> wrote:
>> Everything that was said makes sense and certainly character encoding
>> is something to be considered carefully. But I think part of
>> developing a standard platform is providing a consistent environment
>> as well as a consistent API. If I'm deploying code that uses hard-
>> coded strings, the output of the length property should be consistent
>> across interpreters with all settings being default. As the
>> Spidermonkey example illustrates, we can't (and shouldn't) depend in
>> the different interpreters to ensure this consistency. I think
>> that's
>> why some are arguing for making it a property of the ServerJS
>> standard. I vote for making UTF-8 standard and making it easy to
>> change if necessary.
>
> FWIW, Python has a mechanism to deal with the encoding of source
> files. We could copy that.
>
> I agree that we wouldn't want character encoding of files to hamper
> module sharing.

(not trying to start anything)

It might be beneficial to look at the spec for XSL (and their language
specific implementations - with Saxon being the default) since they
are in many ways similar. Cross language templating v. cross language
scripting:

http://www.w3.org/TR/xslt20/#parsing-and-serialization

perhaps more specifically:

http://www.w3.org/TR/xslt20/#unparsed-text

best,
-Rob

Hannes Wallnoefer

unread,

May 15, 2009, 2:44:40 PM5/15/09

to serv...@googlegroups.com

2009/5/15 Wes Garland <w...@page.ca>:

> I just re-read Dan's message and realized he's saving a *program file*, not
> using File I/O. My comments still stand.
>
> 2009/5/15 Hannes Wallnoefer <han...@gmail.com>
>>
>> Does any spidermonkey maven here have any ideas what may cause this?
>> Maybe an odd config/build switch?
>
> Sure, this is easy. IMHO, the SpiderMonkey answer is just-as-right as the
> Rhino/etc answers. Why? SpiderMonkey is interpreting the heart as three
> distrinct characters, because it's operating in what I've previously
> described as "raw" mode. It sees three bytes, knows nothing about UTF-8,
> and so stores them in a String.

I'm not convinced. For example, spidermonkey standalone also gets the
following wrong, and the ES spec definitly says these are to be
interpreted as two-byte UTF-16 characters:

js> "\u2665"
e

I guess I'm lucky to be on the JVM. I wrote a simple wiki demo app
today, Unicode worked out of the box:
http://hensotest.appspot.com/%E6%97%A5%E6%9C%AC%E8%AA%9E%E3%81%AE%E3%83%9A%E3%83%BC%E3%82%B8%E3%82%92%E6%A4%9C%E7%B4%A2/

> The fact that JavaScript source code has no defined character encoding is
> really at fault here. This is why I suggested last week that we define a
> default encoding of UTF-8 for ServerJS source files, although that
> suggestion was met with no support and plenty of detractors. It appears
> that the other interpreters are assuming that JavaScript source is UTF-8
> whereas SpiderMonkey is assuming that JavaScript source code matches the
> behaviour of C strings. Both are ASSUMING, there is no standard beyond 7
> bits.

I still think it's wrong to decree UTF-8 as encoding for all ServerJS
source code. If somebody starts hacking on some code in an editor,
that editor will use the default encoding and so should the JS engine
you run the code with. If you then run some code with an encoding
other than your system's default, the JS engine should provide
switches or settings to adapt the encoding.

> The reason the firefox JS shell works is because the character encoding is
> something more sophisticated than "C strings". (What shell were you using?
> Jesse Ruderman's? Or something built into the browser?)

I tried with my own tracemonkey build (few days old), the Ubuntu
spidermonkey package, and the firefox JS shell. The first two showd
the (IMO) wrong behaviour.

> Hannes, I haven't tested this, but you should be able to fix this by putting
> SpiderMonkey into C-Strings-are-UTF-8 mode. I'm not sure when it changed --
> probably 1.7/1.8 boundary, but it used to be a compiler #define and is now a
> runtime function, JS_CStringsAreUTF8(). That function affects the behaviour
> of how SpiderMonkey does multibyte to wide char conversion (mbtwcs) and
> vice-versa... what do they call it, inflate/deflate in the source code, I
> think. A read through JS_GetStringBytes() in jsstr.c (or .cpp for jsapi
> 1.8.1) the other also suggests that bleed-edge spidermonkey knows about
> UTF-16 beyond 16-bits..
>
> https://developer.mozilla.org/en/SpiderMonkey/JSAPI_Reference/JS_GetStringBytes

Thanks, but I'm afraid I'm not prolific enough to try fixing this myself.

Hannes

> Wes
>

Jason Orendorff

unread,

May 15, 2009, 3:11:51 PM5/15/09

to serv...@googlegroups.com

On Fri, May 15, 2009 at 9:29 AM, mrogers <marco....@gmail.com> wrote:
> Everything that was said makes sense and certainly character encoding
> is something to be considered carefully. But I think part of
> developing a standard platform is providing a consistent environment
> as well as a consistent API. If I'm deploying code that uses hard-
> coded strings, the output of the length property should be consistent
> across interpreters with all settings being default. As the
> Spidermonkey example illustrates, we can't (and shouldn't) depend in
> the different interpreters to ensure this consistency. I think that's
> why some are arguing for making it a property of the ServerJS
> standard. I vote for making UTF-8 standard and making it easy to
> change if necessary.

Well said.

The SpiderMonkey shell will probably retain its dumb "raw" default
encoding for backwards compatibility. ServerJS should establish UTF-8
as the default, to avoid pointless deployment difficulties.

-j

Kevin Dangoor

unread,

May 15, 2009, 3:14:36 PM5/15/09

to serv...@googlegroups.com

On Fri, May 15, 2009 at 3:11 PM, Jason Orendorff
<jason.o...@gmail.com> wrote:
> The SpiderMonkey shell will probably retain its dumb "raw" default
> encoding for backwards compatibility. ServerJS should establish UTF-8
> as the default, to avoid pointless deployment difficulties.

I agree that UTF-8 is a fine default encoding for source files. I also
further think that Python's standard seems reasonable as a way to
specify an alternate encoding for a source file:

http://www.python.org/dev/peps/pep-0263/

Wes Garland

unread,

May 15, 2009, 3:32:51 PM5/15/09

to serv...@googlegroups.com

On Fri, May 15, 2009 at 2:44 PM, Hannes Wallnoefer <han...@gmail.com> wrote:

I'm not convinced. For example, spidermonkey standalone also gets the
following wrong, and the ES spec definitly says these are to be
interpreted as two-byte UTF-16 characters:

js> "\u2665"
e

Yes, but the ES specification does not discuss what the behaviour of the print command which evaluates text in the REPL is. In this case, unless you have enabled JS_CStringsAreUTF8(), I would expect the evalutor to deflate the string to 8 bit chars by discarding the high byte and outputing the resultant string with printf().

Again -- I haven't test that, I'm just going with knowledge I believe to be correct.

I still think it's wrong to decree UTF-8 as encoding for all ServerJS
source code. If somebody starts hacking on some code in an editor,
that editor will use the default encoding and so should the JS engine
you run the code with. If you then run some code with an encoding
other than your system's default, the JS engine should provide
switches or settings to adapt the encoding.

There's really only three options, from what I can see:

1. You do not believe source code should be interoperable
2. You believe the standard encoding should be UTF-8
3. You believe the standard encoding should be 7-bit ASCII

Which option do you prefer?

I tried with my own tracemonkey build (few days old), the Ubuntu
spidermonkey package, and the firefox JS shell. The first two showd
the (IMO) wrong behaviour.

I'm trying to figure out what you mean by the firefox JS shell. Do you mean you are typing javascript into the location bar of your browser?

Wes

Wes Garland

unread,

May 15, 2009, 3:33:34 PM5/15/09

to serv...@googlegroups.com

> FWIW, Python has a mechanism to deal with the encoding of source
> files. We could copy that.

How does that work?

Wes Garland

unread,

May 15, 2009, 3:40:20 PM5/15/09

to serv...@googlegroups.com

Kevin:

> http://www.python.org/dev/peps/pep-0263/

That implementation requires either changes to the JavaScript parser, or a secondary parse of the top of the source file (but it's certainly doable, at least for gpsee). What it *doesn't* do is handle modules.

Maybe if the first line of a source file -- programs or modules -- begins with "/// serverjs charset utf-8" or something? Maybe could go for a full MIME header, like gettext?

This way we could have modules or programs developed in non-utf-8 encodings which are automatically transcoded at load time?

Jason Orendorff

unread,

May 15, 2009, 3:48:50 PM5/15/09

to serv...@googlegroups.com

On Fri, May 15, 2009 at 1:44 PM, Hannes Wallnoefer <han...@gmail.com> wrote:
>> The fact that JavaScript source code has no defined character encoding is
>> really at fault here. This is why I suggested last week that we define a
>> default encoding of UTF-8 for ServerJS source files, although that
>> suggestion was met with no support and plenty of detractors. It appears
>> that the other interpreters are assuming that JavaScript source is UTF-8
>> whereas SpiderMonkey is assuming that JavaScript source code matches the
>> behaviour of C strings. Both are ASSUMING, there is no standard beyond 7
>> bits.
>
> I still think it's wrong to decree UTF-8 as encoding for all ServerJS
> source code. If somebody starts hacking on some code in an editor,
> that editor will use the default encoding and so should the JS engine
> you run the code with. If you then run some code with an encoding
> other than your system's default, the JS engine should provide
> switches or settings to adapt the encoding.

This is also a reasonable choice, but it's my opinion that UTF-8 is a
significantly better default than "the system's default encoding", for
many reasons.

UTF-8 is very easy to specify and unambiguous.

UTF-8 can represent every character.

UTF-8 makes code more portable. If you use the default encoding, then
once you remove a JS file from that context (the machine where it was
written), it's impossible to tell for sure what the file even means.

UTF-8 looks different enough from other encodings that you can usually
detect encoding mistakes. The same can't be said for all the variants
of Latin-1, for example. Writing code in Latin-1 is a bad idea;
writing code in some Eastern European variant of Latin-1 is worse.
It's good to detect this mistake as soon as possible.

Converting UTF-8 to UTF-16, say, is actually easier to implement than
converting "the system's default encoding" to UTF-16. (I say this from
experience, though of course yours may vary.)

-j

Kris Kowal

unread,

May 15, 2009, 4:53:26 PM5/15/09

to serv...@googlegroups.com

I also support using a consistent encoding in all implementations, so
that source code is interoperable without transcoding from any machine
to any other machine on a network. I also think that all supported
encodings must be a US-ASCII super-set, like UTF-8. UTF-8 is a
reasonable default because it accommodates the entire UNICODE set.

Eventually, we could support something like PEP-0263 to permit the
parser to switch from decoding one US-ASCII super-set to another,
since the expression of the desired charset can be expressed with
entirely US-ASCII code points.

Kris Kowal

Daniel Friesen

unread,

May 16, 2009, 4:01:46 AM5/16/09

to serverjs

Using pep263 as a reference, and a number of comments on the list I
drafted a thought of what I may eventually support in MonkeyScript.
http://draft.monkeyscript.org/wiki/Encoding

I put it on my wiki rather than ServerJS since good portions of it are
fairly MonkeyScript oriented.

As a side note to:
// -*- coding: ISO-8859-1 -*-
Something interesting may actually be to expose the encoding checking
algorithm to JS as a function. (ie: So that modules that are made in a
way they can be used in the browser as well can have the "Content-
type: text/javascript; charset=???" generated based on the encoding
header in the JS file, though I might actually allow an encoding to be
defined in the package metadata for my loader) the // alternate was
added for "other interpreters" but "let the script still be browser
executable to" is another possibility.

In the meantime, I'm part of the "Load things as UTF-8 by default, but
leave room for a possible future definition for methods of specifying
alternative encodings" party.

Ash Berlin

unread,

May 16, 2009, 3:18:08 PM5/16/09

to serv...@googlegroups.com

+1 for default UTF-8 with method or plans to allow other encodings

Reply all

Reply to author

Forward