Is this undefined behavior in JavaScript, a bug, or am I doing
something weird?
Thanks.
print("á".charCodeAt(0))
into a file "x.js", save it with UTF-8 encoding and run it with Rhino  
using
java -jar js.jar x.js
it prints 8730. Turns out "á" is encoded as C3 A1, which is indeed  
UTF-8 for "á". However java.lang.System.getProperty("file.encoding")  
returns "MacRoman", and C3 in MacRoman translates to U+221A "SQUARE  
ROOT" character (decimal 8730). Same happens when directly typing it  
into the console.
So, there's a discrepancy between character encodings: console on Mac  
OS X apparently feeds the characters as UTF-8 encoded byte stream  
through System.in, but Rhino shell reads them as MacRoman, as that's  
the default Java encoding in the JRE (value of the "file.encoding"  
system property). Taken at face value, this is actually a bug in Java;  
if the console is UTF-8 based, the JRE should detect that, and set  
"file.encoding" to utf-8.
We could work around it if Rhino shell had an explicit command line  
encoding declaration, i.e. if you could specify "-c utf-8" -- that'd  
solve it.
Actually, I believe I'll just write code to solve this that'd be  
conformant to RFC-4329.
Attila.
--
home: http://www.szegedi.org
weblog: http://constc.blogspot.com
<https://bugzilla.mozilla.org/show_bug.cgi?id=399347#c3>
A Rhino JAR now built from CVS HEAD correctly prints 225 when launched  
with "-enc utf-8":
MacBook-Ati:rhino aszegedi$ java -jar build/rhino1_7R2pre/js.jar -enc  
utf-8
Rhino 1.7 release 2 PRERELEASE 2008 10 18
js> print("á".charCodeAt(0))
225
js>
Actually if your file is UTF-8, UTF-16, or UTF-32 encoded, and has a  
byte order mark at the beggining of the file, it'll be correctly  
decoded even without the "-enc" parameter.
Attila.
--
home: http://www.szegedi.org
weblog: http://constc.blogspot.com
Also, jsc and shell no longer choke on BOM either, so that also fixes  
bug 399347.
Attila.