I saved a file as UTF8 (the system locale on my system).
-- test.js
print("♥".length);
--
(If the e-mail screws it up for some reason, the content of the string
is the solid unicode heart symbol. It's a single 3byte character.
Rhino `narwhal test.js` returns; 1
SpiderMonkey `js test.js` returns: 3
When I tried saving it as UTF16 both of them suffered fatal errors (not
related to the unicode character, removing it showed that they just had
issues reading the utf16, likely they we're expecting utf8).
So it seams at least two interpreters differ on what they do with
strings inside of a file. Both read utf8 through the file, but only
narwhal actually treats characters right.
Would anyone mind posting up some comparison with what v8 and
JavaScriptCore/SquirrelFish do?
I'm wondering if in MonkeyScript I should have .length and so on /fixed/
so that they treat characters as characters, rather than bytes as
characters. Or continue to consider bytes as characters and document it
that way implementing things like charlength or mb??? methods to handle
unicode properly.
--
~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://daniel.friesen.name]
>
> I just noticed something today about differing handling of utf8 in
> some
> JS interpreters.
>
> I saved a file as UTF8 (the system locale on my system).
> -- test.js
>
> print("♥".length);
>
> --
> (If the e-mail screws it up for some reason, the content of the string
> is the solid unicode heart symbol. It's a single 3byte character.
>
> Rhino `narwhal test.js` returns; 1
> SpiderMonkey `js test.js` returns: 3
Narwhal (on Rhino, at least) assumes JavaScript modules are UTF-8
encoded, and thus explicitly uses UTF-8 encoding when reading them.
The Rhino shell, however, uses the platform's default encoding by
default when loading code. They recently added a "-encoding" option to
the shell to override it. See here: http://groups.google.com/group/mozilla.dev.tech.js-engine.rhino/browse_thread/thread/b6c5db11c5584749
So I think it doesn't really matter what the interpreters do in their
included shells, since our implementations can do whatever we want. If
it were up to me, I'd specify all JavaScript modules must be stored as
UTF-8 (also "That ServerJS programs be stored in UTF-8" was Wes's 3rd
proposed "promise" in the "ServerJS Character Sets" thread).
> When I tried saving it as UTF16 both of them suffered fatal errors
> (not
> related to the unicode character, removing it showed that they just
> had
> issues reading the utf16, likely they we're expecting utf8).
>
> So it seams at least two interpreters differ on what they do with
> strings inside of a file. Both read utf8 through the file, but only
> narwhal actually treats characters right.
>
> Would anyone mind posting up some comparison with what v8 and
> JavaScriptCore/SquirrelFish do?
>
> I'm wondering if in MonkeyScript I should have .length and so on /
> fixed/
> so that they treat characters as characters, rather than bytes as
> characters. Or continue to consider bytes as characters and document
> it
> that way implementing things like charlength or mb??? methods to
> handle
> unicode properly.
Yes, .length on a String should be the number of characters.
-Tom
I can reproduce this. To me it looks like a bug in the standalone
Spidermonkey. The Firefox JS shell returns 1, and the same character
codes as Rhino.
Does any spidermonkey maven here have any ideas what may cause this?
Maybe an odd config/build switch?
Hannes
Would anyone mind posting up some comparison with what v8 and
JavaScriptCore/SquirrelFish do?
Would anyone mind posting up some comparison with what v8 and
JavaScriptCore/SquirrelFish do?
V8 reports (correctly, IMHO) 1.
Does any spidermonkey maven here have any ideas what may cause this?
Maybe an odd config/build switch?
FWIW, Python has a mechanism to deal with the encoding of source
files. We could copy that.
I agree that we wouldn't want character encoding of files to hamper
module sharing.
Kevin
--
Kevin Dangoor
work: http://labs.mozilla.com/
email: k...@blazingthings.com
blog: http://www.BlueSkyOnMars.com
>
> On Fri, May 15, 2009 at 10:29 AM, mrogers <marco....@gmail.com>
> wrote:
>> Everything that was said makes sense and certainly character encoding
>> is something to be considered carefully. But I think part of
>> developing a standard platform is providing a consistent environment
>> as well as a consistent API. If I'm deploying code that uses hard-
>> coded strings, the output of the length property should be consistent
>> across interpreters with all settings being default. As the
>> Spidermonkey example illustrates, we can't (and shouldn't) depend in
>> the different interpreters to ensure this consistency. I think
>> that's
>> why some are arguing for making it a property of the ServerJS
>> standard. I vote for making UTF-8 standard and making it easy to
>> change if necessary.
>
> FWIW, Python has a mechanism to deal with the encoding of source
> files. We could copy that.
>
> I agree that we wouldn't want character encoding of files to hamper
> module sharing.
(not trying to start anything)
It might be beneficial to look at the spec for XSL (and their language
specific implementations - with Saxon being the default) since they
are in many ways similar. Cross language templating v. cross language
scripting:
http://www.w3.org/TR/xslt20/#parsing-and-serialization
perhaps more specifically:
http://www.w3.org/TR/xslt20/#unparsed-text
best,
-Rob
I'm not convinced. For example, spidermonkey standalone also gets the
following wrong, and the ES spec definitly says these are to be
interpreted as two-byte UTF-16 characters:
js> "\u2665"
e
I guess I'm lucky to be on the JVM. I wrote a simple wiki demo app
today, Unicode worked out of the box:
http://hensotest.appspot.com/%E6%97%A5%E6%9C%AC%E8%AA%9E%E3%81%AE%E3%83%9A%E3%83%BC%E3%82%B8%E3%82%92%E6%A4%9C%E7%B4%A2/
> The fact that JavaScript source code has no defined character encoding is
> really at fault here. This is why I suggested last week that we define a
> default encoding of UTF-8 for ServerJS source files, although that
> suggestion was met with no support and plenty of detractors. It appears
> that the other interpreters are assuming that JavaScript source is UTF-8
> whereas SpiderMonkey is assuming that JavaScript source code matches the
> behaviour of C strings. Both are ASSUMING, there is no standard beyond 7
> bits.
I still think it's wrong to decree UTF-8 as encoding for all ServerJS
source code. If somebody starts hacking on some code in an editor,
that editor will use the default encoding and so should the JS engine
you run the code with. If you then run some code with an encoding
other than your system's default, the JS engine should provide
switches or settings to adapt the encoding.
> The reason the firefox JS shell works is because the character encoding is
> something more sophisticated than "C strings". (What shell were you using?
> Jesse Ruderman's? Or something built into the browser?)
I tried with my own tracemonkey build (few days old), the Ubuntu
spidermonkey package, and the firefox JS shell. The first two showd
the (IMO) wrong behaviour.
> Hannes, I haven't tested this, but you should be able to fix this by putting
> SpiderMonkey into C-Strings-are-UTF-8 mode. I'm not sure when it changed --
> probably 1.7/1.8 boundary, but it used to be a compiler #define and is now a
> runtime function, JS_CStringsAreUTF8(). That function affects the behaviour
> of how SpiderMonkey does multibyte to wide char conversion (mbtwcs) and
> vice-versa... what do they call it, inflate/deflate in the source code, I
> think. A read through JS_GetStringBytes() in jsstr.c (or .cpp for jsapi
> 1.8.1) the other also suggests that bleed-edge spidermonkey knows about
> UTF-16 beyond 16-bits..
>
> https://developer.mozilla.org/en/SpiderMonkey/JSAPI_Reference/JS_GetStringBytes
Thanks, but I'm afraid I'm not prolific enough to try fixing this myself.
Hannes
> Wes
>
Well said.
The SpiderMonkey shell will probably retain its dumb "raw" default
encoding for backwards compatibility. ServerJS should establish UTF-8
as the default, to avoid pointless deployment difficulties.
-j
I agree that UTF-8 is a fine default encoding for source files. I also
further think that Python's standard seems reasonable as a way to
specify an alternate encoding for a source file:
http://www.python.org/dev/peps/pep-0263/
I'm not convinced. For example, spidermonkey standalone also gets the
following wrong, and the ES spec definitly says these are to be
interpreted as two-byte UTF-16 characters:
js> "\u2665"
e
I still think it's wrong to decree UTF-8 as encoding for all ServerJS
source code. If somebody starts hacking on some code in an editor,
that editor will use the default encoding and so should the JS engine
you run the code with. If you then run some code with an encoding
other than your system's default, the JS engine should provide
switches or settings to adapt the encoding.
I tried with my own tracemonkey build (few days old), the Ubuntu
spidermonkey package, and the firefox JS shell. The first two showd
the (IMO) wrong behaviour.
This is also a reasonable choice, but it's my opinion that UTF-8 is a
significantly better default than "the system's default encoding", for
many reasons.
UTF-8 is very easy to specify and unambiguous.
UTF-8 can represent every character.
UTF-8 makes code more portable. If you use the default encoding, then
once you remove a JS file from that context (the machine where it was
written), it's impossible to tell for sure what the file even means.
UTF-8 looks different enough from other encodings that you can usually
detect encoding mistakes. The same can't be said for all the variants
of Latin-1, for example. Writing code in Latin-1 is a bad idea;
writing code in some Eastern European variant of Latin-1 is worse.
It's good to detect this mistake as soon as possible.
Converting UTF-8 to UTF-16, say, is actually easier to implement than
converting "the system's default encoding" to UTF-16. (I say this from
experience, though of course yours may vary.)
-j
Eventually, we could support something like PEP-0263 to permit the
parser to switch from decoding one US-ASCII super-set to another,
since the expression of the desired charset can be expressed with
entirely US-ASCII code points.
Kris Kowal
+1 for default UTF-8 with method or plans to allow other encodings