Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

unicode whitespace problem

31 views
Skip to first unread message

Eric Suen

unread,
Sep 5, 2006, 12:14:18 PM9/5/06
to
Hi,

Can anyone explain why:

var i\u2000=1 //valid
var j \u2000 = 1 //invalid

Is it defined by ECMA spec or the bugs of SpiderMondey

Thanks

Eric


Martin Honnen

unread,
Sep 5, 2006, 2:50:15 PM9/5/06
to
Eric Suen wrote:


> Can anyone explain why:
>
> var i\u2000=1 //valid
> var j \u2000 = 1 //invalid
>
> Is it defined by ECMA spec or the bugs of SpiderMondey


ECMAScript edition 3 says this

"Unicode escape sequences are also permitted in identifiers, where they
contribute a single character to the
identifier, as computed by the CV of the UnicodeEscapeSequence (see
section 7.8.4). The \ preceding the
UnicodeEscapeSequence does not contribute a character to the identifier.
A UnicodeEscapeSequence cannot be
used to put a character into an identifier that would otherwise be
illegal. In other words, if a \
UnicodeEscapeSequence sequence were replaced by its
UnicodeEscapeSequence's CV, the result must still be a
valid Identifier that has the exact same sequence of characters as the
original Identifier."

and then has this grammar

IdentifierName ::
IdentifierStart
IdentifierName IdentifierPart
IdentifierStart ::
UnicodeLetter
$
_
\ UnicodeEscapeSequence


IdentifierPart ::
IdentifierStart
UnicodeCombiningMark
UnicodeDigit
UnicodeConnectorPunctuation
\ UnicodeEscapeSequence
UnicodeLetter
any character in the Unicode categories “Uppercase letter (Lu)”,
“Lowercase letter (Ll)”, “Titlecase letter (Lt)”,
“Modifier letter (Lm)”, “Other letter (Lo)”, or “Letter number (Nl)”.
UnicodeCombiningMark
any character in the Unicode categories “Non-spacing mark (Mn)” or
“Combining spacing mark (Mc)”
UnicodeDigit
any character in the Unicode category “Decimal number (Nd)”
UnicodeConnectorPunctuation
any character in the Unicode category “Connector punctuation (Pc)”


I think both the normal ASCII space \u0020 and that \u2000 space you are
using fall into the Unicode category space separator so they should not
be allowed in an identifier if I read that spec correctly. But I have
never looked into that part before.

Neither Rhino

Rhino 1.6 release 2 2005 09 19
js> var a\u0020 = 1;
js> for (var propertyName in this) { print("'" + propertyName + "'"); }
'a '
'propertyName'

nor Spidermonkey
<http://home.arcor.de/martin.honnen/mozillaBugs/ecmascript/identifierEscapeTest1.html>
in Firefox 1.5 seem to care.

I do not get any script syntax error with Opera 9 either for that test URL.

IE 6 gives two syntax errors "invalid character" (my translation of the
German error message I get here).

--

Martin Honnen
http://JavaScript.FAQTs.com/

Martin Honnen

unread,
Sep 6, 2006, 12:34:39 PM9/6/06
to
Martin Honnen wrote:


> I think both the normal ASCII space \u0020 and that \u2000 space you are
> using fall into the Unicode category space separator so they should not
> be allowed in an identifier if I read that spec correctly.

A test case alike

var v\u0020 = 1;


for (var propertyName in this) {

print("propertyName: '" + propertyName + "'; length: " +
propertyName.length);
}

shows that Rhino takes that v\u0020 as the identifier:

Rhino 1.6 release 3 2006 07 24
js> load('mozillaBugs/javascript/identifierEscapeTest1.js')
propertyName: 'propertyName'; length: 12
propertyName: 'v '; length: 2


Spidermonkey (trunk shell) however creates a variable with the name 'v'
and ignores the \u0020:

js> load('mozillaBugs/javascript/identifierEscapeTest1.js')
propertyName: 'v'; length: 1
propertyName: 'propertyName'; length: 12


I don't think that either behaviour is correct.

Taking the \u0020 as part of the identifier as Rhino does is not allowed

as ECMAScript edition 3 section 7.6 says:
"A UnicodeEscapeSequence cannot be
used to put a character into an identifier that would otherwise be illegal."

On the other hand, section 6 restricts Unicode escape sequences to
string literals, regular expression literals and identifiers:
"In string literals, regular expression literals and identifiers, any
character (code point) may also be expressed as a
Unicode escape sequence consisting of six characters, namely \u plus
four hexadecimal digits."

So taking 'v' as an identifier token, as Spidermonkey seems to do, and
ignoring the \u0020 as white space seems wrong too.

Eric Suen

unread,
Sep 7, 2006, 12:15:05 PM9/7/06
to
Here is the code:

i \u002b= 1; //valid
i \u002b\u003d 1; //invalid

In Java, two statement are both valid, according the ECMA spec,
both of them are invalid, so, what exactly the rule of unicode
escape in SpiderMonkey

Eric


Stanimir Stamenkov

unread,
Sep 7, 2006, 12:28:21 PM9/7/06
to
/Eric Suen/:

> i \u002b= 1; //valid
> i \u002b\u003d 1; //invalid
>
> In Java, two statement are both valid, according the ECMA spec,
> both of them are invalid

The Java language specification states the Unicode escapes [1] are
recognized as first step of the processing so after all both
statements are translated to:

i += 1;
i += 1;

I don't know details of the ECMAScript specification, but Martin has
already pointed it "restricts Unicode escape sequences to string
literals, regular expression literals and identifiers". So the
statements you've given contain lexical/syntax errors the least.

[1]
http://java.sun.com/docs/books/jls/third_edition/html/lexical.html#3.3

--
Stanimir

Martin Honnen

unread,
Sep 7, 2006, 1:02:10 PM9/7/06
to
Eric Suen wrote:

Can't answer that, if you want a definitive answer consider either ccing
Brendan Eich or filing a bug on bugzilla.mozilla.org on product
Core/component JavaScript engine.

Martin Honnen

unread,
Sep 10, 2006, 11:47:11 AM9/10/06
to
Martin Honnen wrote:


> A test case alike
>
> var v\u0020 = 1;
> for (var propertyName in this) {
> print("propertyName: '" + propertyName + "'; length: " +
> propertyName.length);
> }
>
> shows that Rhino takes that v\u0020 as the identifier:
>
> Rhino 1.6 release 3 2006 07 24
> js> load('mozillaBugs/javascript/identifierEscapeTest1.js')
> propertyName: 'propertyName'; length: 12
> propertyName: 'v '; length: 2
>
>
> Spidermonkey (trunk shell) however creates a variable with the name 'v'
> and ignores the \u0020:
>
> js> load('mozillaBugs/javascript/identifierEscapeTest1.js')
> propertyName: 'v'; length: 1
> propertyName: 'propertyName'; length: 12
>
>
> I don't think that either behaviour is correct.


I have filed <https://bugzilla.mozilla.org/show_bug.cgi?id=352042> on
Rhino and <https://bugzilla.mozilla.org/show_bug.cgi?id=352044> on
Spidermonkey.

Michael Daeumling

unread,
Sep 15, 2006, 12:43:46 AM9/15/06
to
Ecma-262 is clear about which letters are Unicode identifier letters. These
are the Unicode groups:

UnicodeLetter

any character in the Unicode categories "Uppercase letter (Lu)", "Lowercase
letter (Ll)", "Titlecase letter (Lt)",

"Modifier letter (Lm)", "Other letter (Lo)", or "Letter number (Nl)".

UnicodeCombiningMark

any character in the Unicode categories "Non-spacing mark (Mn)" or
"Combining spacing mark (Mc)"

UnicodeDigit

any character in the Unicode category "Decimal number (Nd)"

UnicodeConnectorPunctuation

any character in the Unicode category "Connector punctuation (Pc)"

AND:

A UnicodeEscapeSequence

cannot be used to put a character into an identifier that would otherwise be
illegal.

The parser simply converts Unicode escapes into the respective characters,
so if it sees a whitespace or, like in above examples, a Plus sign, it is
treated as such. A Unicode escape inside a symbol does not make it part of
the symbol, but it is merely another character in the input stream.

All JavaScript implementation that act differently do not conform to the
spec IMHO.

Michael

--
Michael

"Martin Honnen" <maho...@yahoo.de> wrote in message
news:svWdnVO9YocfrpnY...@mozilla.org...

Martin Honnen

unread,
Sep 15, 2006, 8:25:29 AM9/15/06
to
Michael Daeumling wrote:


> A UnicodeEscapeSequence
>
> cannot be used to put a character into an identifier that would otherwise be
> illegal.
>
> The parser simply converts Unicode escapes into the respective characters,
> so if it sees a whitespace or, like in above examples, a Plus sign, it is
> treated as such. A Unicode escape inside a symbol does not make it part of
> the symbol, but it is merely another character in the input stream.

Well what about

Section 6 (ECMAScript edition 3) restricts Unicode escape sequences to

string literals, regular expression literals and identifiers:
"In string literals, regular expression literals and identifiers, any
character (code point) may also be expressed as a
Unicode escape sequence consisting of six characters, namely \u plus
four hexadecimal digits."

That means that you cannot have Unicode escape sequences anywhere in the
input stream. Thus when e.g.


var v\u0020 = 1;

is tokenized it seems Rhino takes \u0020 as part of the identifier while
Spidermonkey ignores the escaped space or drops it. I agree that the
space does not belong into the identifier but taking the above quotation
restricting the use of escape sequences to string literals, regular
expression literals and identifiers, is
var v\u0020
then not a syntax error?

0 new messages