JavaScript Functions

Bruce A. Julseth

unread,

Dec 26, 2008, 9:37:41 PM12/26/08

to

I've been developing in PHP and now want to learn the client side with
JavaScript..

One of the first problems I have come across is what string functions are
available. Right now I'm looking for a "trim" funciton. Is the one? If not,
how can I do it?

Also, when my browser (IE, FireFox, Safari, Opera) finds a syntext problem,
it just plain quits executing with no error message. Is there a "switch" I
need to turn on in my browers?

Thank you....

L.@canberra Trevor Lawrence

unread,

Dec 26, 2008, 11:42:18 PM12/26/08

to

"Bruce A. Julseth" <julebj...@bellsouth.net> wrote in message
news:rqg5l.13290$n_5....@bignews7.bellsouth.net...

Hmm, an intersting quetsion
I had a look at http://www.w3schools.com/jsref/jsref_obj_string.asp
and there does not appear to be a trim function

A nice long winded way to left trim is to search for the first occurrence of
a non blank character and substring from that point, e.g.
<script type="text/javascript">
var str=" Hello world!", i, x ;
for (i = 0; i < str.length; i++) {
if(str.substr(i,1) != " ")
{ x=i; i = str.length; }
}
document.write(str.substr(x))
</script>

It would be similar for a right trim, using a reverse search, but I am
having trouble doing it

In any case, the experts here will no doubt find a better way to do it

Re syntax errors,
In IE7 try clicking the error icon on the bottom left. It doesn't always
help a great deal though.
Firefox has a plug-in named Firebug. Look for it on the FF site.
--
Trevor Lawrence
Canberra
Web Site http://trevorl.mvps.org

SAM

unread,

Dec 27, 2008, 12:02:17 AM12/27/08

to

Le 12/27/08 3:37 AM, Bruce A. Julseth a écrit :

> I've been developing in PHP and now want to learn the client side with
> JavaScript..
>
> One of the first problems I have come across is what string functions are
> available. Right now I'm looking for a "trim" funciton. Is the one? If not,
> how can I do it?

reg Expressions ?

// delete/suppress blank characters in beginning and end

function trim(strg) {
return strg.replace(/^\s+|\s+$/g, '');
}

<https://developer.mozilla.org/en/Core_JavaScript_1.5_Guide/Creating_a_Regular_Expression>

> Also, when my browser (IE, FireFox, Safari, Opera) finds a syntext problem,
> it just plain quits executing with no error message. Is there a "switch" I
> need to turn on in my browers?

Firefox : menu Tools / Errors Console
Safari : preferences / advanced / [X] active developpement menu
then : menu Developpement / Display console

--
sm

SAM

unread,

Dec 27, 2008, 12:06:39 AM12/27/08

to

Le 12/27/08 6:02 AM, SAM a écrit :

>
> // delete/suppress blank characters in beginning and end
>
> function trim(strg) {
> return strg.replace(/^\s+|\s+$/g, '');
> }

function trimLeft(strg) {
return strg.replace(/^\s+/, '');
}

function trimRight(strg) {
return strg.replace(/\s+$/, '');
}

--
sm

RobG

unread,

Dec 27, 2008, 2:52:00 AM12/27/08

to

"Bruce A. Julseth" <julebj...@bellsouth.net> wrote:
> I've been developing in PHP and now want to learn the client side with
>
> JavaScript..
>
> One of the first problems I have come across is what string functions
> are
> available.

The authoritative reference for properties and methods of built-in
objects (such as String) is the ECMA-262 specification.

Right now I'm looking for a "trim" funciton. Is the one?

Not a built-in function. .

If not,
> how can I do it?

There is one in the FAQ, which also has links to useful resources.

--
Rob

Gregor Kofler

unread,

Dec 27, 2008, 8:45:10 AM12/27/08

to

Bruce A. Julseth meinte:

> I've been developing in PHP and now want to learn the client side with
> JavaScript..
>
> One of the first problems I have come across is what string functions are
> available. Right now I'm looking for a "trim" funciton. Is the one? If not,
> how can I do it?

No. Besides it would be a method of the string (prototype) object.

function trim(yourString) {
return yourString.replace(/^\s+\s+$/, "");
}

> Also, when my browser (IE, FireFox, Safari, Opera) finds a syntext problem,
> it just plain quits executing with no error message. Is there a "switch" I
> need to turn on in my browers?

Do yourself a favour and get proper add-ons (or activate) them. FF has
Firebug, Opera has Dragonfly (Extras->(whatever that menu entry is
called in English)->Developer Tools), Safari has somewhat less capable
but still sufficient tools (activated somewhere in the options dialog,
being on Linux I can't test that right now).

Gregor

Thomas 'PointedEars' Lahn

unread,

Dec 27, 2008, 9:18:12 AM12/27/08

to

Gregor Kofler wrote:
> Bruce A. Julseth meinte:

>> One of the first problems I have come across is what string functions are
>> available. Right now I'm looking for a "trim" funciton. Is the one? If not,
>> how can I do it?
>
> No. Besides it would be a method of the string (prototype) object.
>
> function trim(yourString) {
> return yourString.replace(/^\s+\s+$/, "");
> }

Apparently you forgot a | between the two \s+, and the global flag. See the
FAQ entry.

>> Also, when my browser (IE, FireFox, Safari, Opera) finds a syntext problem,
>> it just plain quits executing with no error message. Is there a "switch" I
>> need to turn on in my browers?
>

> Do yourself a favour and get proper add-ons (or activate) them. [...]

> Opera has Dragonfly (Extras->(whatever that menu entry is called in

> English)->Developer Tools), [...]

Speaking of which, does anyone know what I can do about always getting a
"Select a runtime" message when entering, say, 1 in the Command Line pane of
Dragonfly's Script tab in Opera/9.63 (X11; Linux i686; U; en) Presto/2.1.1?
So far I have found nothing relevant in the Settings (button with tools
icon, lower right corner of the Dragonfly pane).

TIA

PointedEars

Thomas 'PointedEars' Lahn

unread,

Dec 27, 2008, 9:30:15 AM12/27/08

to

Thomas 'PointedEars' Lahn wrote:
> Gregor Kofler wrote:
>> Bruce A. Julseth meinte:

>>> Also, when my browser (IE, FireFox, Safari, Opera) finds a syntext problem,
>>> it just plain quits executing with no error message. Is there a "switch" I
>>> need to turn on in my browers?
>> Do yourself a favour and get proper add-ons (or activate) them. [...]
>> Opera has Dragonfly (Extras->(whatever that menu entry is called in
>> English)->Developer Tools), [...]
>
> Speaking of which, does anyone know what I can do about always getting a
> "Select a runtime" message when entering, say, 1 in the Command Line pane of
> Dragonfly's Script tab in Opera/9.63 (X11; Linux i686; U; en) Presto/2.1.1?
> So far I have found nothing relevant in the Settings (button with tools
> icon, lower right corner of the Dragonfly pane).

I think I've got it. For Scripts/Command Line to work properly, you have to
select a window/tab as global execution context under Scripts/Scripts first;
unlike Firebug, with Dragonfly the current tab is not automatically
selected. (I'd rather they changed that.)

PointedEars

Gregor Kofler

unread,

Dec 27, 2008, 10:20:05 AM12/27/08

to

Thomas 'PointedEars' Lahn meinte:

> Gregor Kofler wrote:
>> Bruce A. Julseth meinte:
>>> One of the first problems I have come across is what string functions are
>>> available. Right now I'm looking for a "trim" funciton. Is the one? If not,
>>> how can I do it?
>> No. Besides it would be a method of the string (prototype) object.
>>
>> function trim(yourString) {
>> return yourString.replace(/^\s+\s+$/, "");
>> }
>
> Apparently you forgot a | between the two \s+, and the global flag. See the
> FAQ entry.

Oops. Yes, I should re-read my posts...

Gregor

SAM

unread,

Dec 27, 2008, 11:42:59 AM12/27/08

to

Le 12/27/08 4:20 PM, Gregor Kofler a écrit :

> Thomas 'PointedEars' Lahn meinte:
>> Gregor Kofler wrote:
>>> Bruce A. Julseth meinte:

>>>> Right now I'm looking for a "trim" funciton.
>>>

>>> function trim(yourString) {
>>> return yourString.replace(/^\s+\s+$/, "");
>>> }
>>
>> Apparently you forgot a | between the two \s+, and the global flag.
>> See the
>> FAQ entry.
>
> Oops. Yes, I should re-read my posts...

And other ones ?
(I previously gave it while yours corrected is not yet complete)

<http://jibbering.com/faq/#trimString>

--
sm

kangax

unread,

Dec 27, 2008, 4:30:44 PM12/27/08

to

RobG wrote:

[...]

> If not,
>> how can I do it?
>
> There is one in the FAQ, which also has links to useful resources.
>
>

Perhaps, FAQ should mention that RegExp whitespace character class does
not conform to specification (see: 15.10.2.12) in some of the browsers
and that `trim` (as it is right now in the FAQ) will fail to remove some
of those characters.

A simple example would be:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
<title></title>
</head>
<body>
<script type="text/javascript">
(function(){
function trim(s) {
return s.replace(/^\s+|\s+$/g, '');
}
var s = ' \n\r\t\x0B\f\xA0 hello \xA0\n \r\t\f\x0B';
// should be "5"
document.write(trim(s).length);
})();
</script>
</body>
</html>

This fails in IE, Safari (although, fixed in webkit nightlies) and Chrome.

A workaround is, of course, simple:

var strip = (function(){
var wspClass = '[\\x09\\x0B\\x0C\\x20\\xA0\\x0A\\x0D\\u2028\\u2029]';
var leadingSpace = new RegExp('^' + wspClass + '+');
var trailingSpace = new RegExp(wspClass + '+$');
return function(s) {
return s.replace(leadingSpace, '').replace(trailingSpace, '');
}
})();

--
kangax

SAM

unread,

Dec 27, 2008, 5:52:20 PM12/27/08

to

Le 12/27/08 10:30 PM, kangax a écrit :

> RobG wrote:
>
> [...]
>
>> If not,
>>> how can I do it?
>>
>> There is one in the FAQ, which also has links to useful resources.
>>
>
> Perhaps, FAQ should mention that RegExp whitespace character class does
> not conform to specification (see: 15.10.2.12) in some of the browsers
> and that `trim` (as it is right now in the FAQ) will fail to remove some
> of those characters.

Is   or   or \xA0 a 'blank' (or white-space) character ?

My Firefox.3 and Opera.9 seem to think it is.
My iCab.4 and Safari.3 think it is not.

> var s = ' \n\r\t\x0B\f\xA0 hello \xA0\n \r\t\f\x0B';
> // should be "5"
> document.write(trim(s).length);

> fails in IE, Safari (although, fixed in webkit nightlies) and Chrome.

In IE ... I don't know.

> A workaround is, of course, simple:
>
> var strip = (function(){
> var wspClass = '[\\x09\\x0B\\x0C\\x20\\xA0\\x0A\\x0D\\u2028\\u2029]';
> var leadingSpace = new RegExp('^' + wspClass + '+');
> var trailingSpace = new RegExp(wspClass + '+$');
> return function(s) {
> return s.replace(leadingSpace, '').replace(trailingSpace, '');
> }
> })();

Rest to know if really we want to delete this unbreakable character ?

--
sm

L.@canberra Trevor Lawrence

unread,

Dec 27, 2008, 6:55:43 PM12/27/08

to

"Trevor Lawrence" <Trevor L.@Canberra> wrote in message
news:newscache$heqick$vo7$1...@news.grapevine.com.au...

While I appreciate the elegance of the RegExp solutions, there seemed to be
some problems with defining what is white space, so I wondered if the simple
functions below would do the job just as well

function ltrim(str) {
for (var i = 0; i < str.length; i++) {
if(str.substr(i,1)!=" ")
break;
}
return str.substr(i);
}

function rtrim(str) {
for (var i = str.length-1; i >= 0; i--) {
if(str.substr(i,1)!=" ")
break;
}
return str.substr(0,i+1);
}

function trim(str) {
return ltrim(rtrim(str));
}

kangax

unread,

Dec 27, 2008, 7:22:30 PM12/27/08

to

SAM wrote:

[...]

>> Perhaps, FAQ should mention that RegExp whitespace character class
>> does not conform to specification (see: 15.10.2.12) in some of the
>> browsers and that `trim` (as it is right now in the FAQ) will fail to
>> remove some of those characters.
>
> Is   or   or \xA0 a 'blank' (or white-space) character ?

I was only talking about WhiteSpace (7.2) and LineTerminator (7.3)
productions. Former one includes '\xA0' and so, yes, it should be
matched by /\s/. I filed a bug with Chrome [1] but only after noticed
that nightly webkit does not exhibit this (so it's only a matter of
updating Chrome to the newer version)

[...]

[1] http://code.google.com/p/chromium/issues/detail?id=5206

Thomas 'PointedEars' Lahn

unread,

Dec 28, 2008, 4:20:33 AM12/28/08

to

"Trevor Lawrence" wrote:
> While I appreciate the elegance of the RegExp solutions, there seemed
> to be
> some problems with defining what is white space, so I wondered if the
> simple
> functions below would do the job just as well

> [...]

Dear green beginners,

simple functions like these is how this thing started about a decade
ago. Since they turned out to be inadequate (handle too few cases, are
inefficient from the outset, do not scale well), RegExp matching and
replacing was introduced.

That some ECMAScript implementations might not match all specified
white-space characters and line terminators with \s really is no reason
at all for us to prefer Wheel 0.1 that could only jump 1 cm (handle only
one white-space character, space, inefficiently aso.), because RegExp
allows us to define our own character classes.

Thank you in advance.

PointedEars

Dr J R Stockton

unread,

Dec 28, 2008, 7:51:40 AM12/28/08

to

In comp.lang.javascript message <4956b1a5$0$9375$ba4a...@news.orange.fr
>, Sat, 27 Dec 2008 23:52:20, SAM <stephanemor...@wanadoo.fr.in
valid> posted:

>Rest to know if really we want to delete this unbreakable character ?

The FAQ entry should list the characters that the simple code ought to
delete, and the known exceptions in reasonably common browsers. There,
"ought" could be according to 16262 or according to the majority of
usage.

The FAQ at Jibbering has not been changed for a month and is dated a
fortnight earlier. The auto-posted FAQ in this group is dated a month
earlier than that.

--
(c) John Stockton, nr London UK. ?@merlyn.demon.co.uk IE7 FF2 Op9 Sf3
news:comp.lang.javascript FAQ <URL:http://www.jibbering.com/faq/index.html>.
<URL:http://www.merlyn.demon.co.uk/js-index.htm> jscr maths, dates, sources.
<URL:http://www.merlyn.demon.co.uk/> TP/BP/Delphi/jscr/&c, FAQ items, links.

SAM

unread,

Dec 28, 2008, 1:56:01 PM12/28/08

to

Le 12/28/08 12:55 AM, Trevor Lawrence a écrit :

>
> While I appreciate the elegance of the RegExp solutions, there seemed to be
> some problems with defining what is white space, so I wondered if the simple
> functions below would do the job just as well

Probably they search and eliminate simple white space : ' '
and that doesn't solve the problem to identify other spaces required in
a "trim" function (what to cut of ?).

function trimWhiteSpace(strg) {
return strg.replace(/^ +| +$/g, '');
}

document.write('['+trimWhiteSpace(" Hello world! ")+']');

--> [Hello world!]

function lTrimWS(strg) { return strg.replace(/^ +/, ''); }

function rTrimWS(strg) { return strg.replace(/ +$/, ''); }

If we refer to the PHP trim function(1) we would have to strip from
beginning and end the invisible following characters :
* " " (ASCII 32 (0x20)), an ordinary space.
* "\t" (ASCII 9 (0x09)), a tab.
* "\n" (ASCII 10 (0x0A)), a new line (line feed).
* "\r" (ASCII 13 (0x0D)), a carriage return.
* "\0" (ASCII 0 (0x00)), the NUL-byte.
* "\x0B" (ASCII 11 (0x0B)), a vertical tab.
whom unbreakable space is absent

what is supposed to almost get in reg expression(2) using: \s
witch would have to be equivalent with :

Gecko :
[\t\n\v\f\r
\u00a0\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200a\u200b\u2028\u2029\u3000]
(where \u00a0 is the unbreakable space)

Microsoft :
[\f\n\r\t\v] (and they forget ' ' !)

So we could have :

function trimer( strg, extendedBlanks ) {
var reg = '[\f\n\r\t\v ' + ( extendedBlanks? '\xa0' : '' ) + ']';
reg = new RegExp( '^' + reg + '+|' + reg + '+$', 'g' );
return strg.replace(reg,'');
}

And with the string :
var strng = ' \n\t \n \xa0 hello \xa0 \n \n';
We can get :
document.write('['+ trimer(strng) +']'); // [ hello ]
document.write('['+ trimer(strng,1) +']'); // [hello]

Tested : FF.3, Safari.3, Opera.9, iCab.4

(1)
<http://fr.php.net/manual/en/function.trim.php>
(2)
<http://msdn.microsoft.com/en-us/library/se61087k(VS.85).aspx>
<https://developer.mozilla.org/En/Core_JavaScript_1.5_Reference:Objects:RegExp>

--
sm

Dr J R Stockton

unread,

Dec 28, 2008, 1:45:46 PM12/28/08

to

In comp.lang.javascript message <newscache$us7kck$gbg$1...@news.grapevine.c
om.au>, Sat, 27 Dec 2008 23:55:43, Trevor Lawrence <Trevor@L.?.invalid>
posted:

>
>While I appreciate the elegance of the RegExp solutions, there seemed to be
>some problems with defining what is white space, so I wondered if the simple
>functions below would do the job just as well
>
>function ltrim(str) {
>for (var i = 0; i < str.length; i++) {
> if(str.substr(i,1)!=" ")
> break;
>}
>return str.substr(i);
>}

No point in coding like that. One can just modify the RegExp routines
in the FAQ by replacing "\s" with " " and changing "whitespace" to
"space(s) in the description."

AFAIK, all common browsers include in their "whitespace character = \s"
all the characters which normally need to be handled. The differing
characters can only appear if entered by a programmer, who can make
appropriate tests, or by an end user - and an end user who contrives to
enter an esoteric separator deserves whatever he gets.

The question about \xA0, however, is a good one that should be
considered for the FAQ entry.

--
(c) John Stockton, Surrey, UK. ?@merlyn.demon.co.uk Turnpike v6.05 MIME.
Web <URL:http://www.merlyn.demon.co.uk/> - FAQish topics, acronyms, & links.
Proper <= 4-line sig. separator as above, a line exactly "-- " (SonOfRFC1036)
Do not Mail News to me. Before a reply, quote with ">" or "> " (SonOfRFC1036)

Garrett Smith

unread,

Feb 12, 2009, 3:57:53 AM2/12/09

to

kangax wrote:
> SAM wrote:
>
> [...]
>
>>> Perhaps, FAQ should mention that RegExp whitespace character class
>>> does not conform to specification (see: 15.10.2.12) in some of the
>>> browsers and that `trim` (as it is right now in the FAQ) will fail to
>>> remove some of those characters.
>>

Good point. That should definitely be in the FAQ.

>> Is   or   or \xA0 a 'blank' (or white-space) character ?
>

Is this fixed in recent JScript in IE8?

Question: Why not:-

function trimString(s){
return s.replace(/^[\s|\xA0]+|[\s|\xA0]+$/g, '');
}

?

> I was only talking about WhiteSpace (7.2) and LineTerminator (7.3)
> productions. Former one includes '\xA0' and so, yes, it should be
> matched by /\s/. I filed a bug with Chrome [1] but only after noticed
> that nightly webkit does not exhibit this (so it's only a matter of
> updating Chrome to the newer version)
>
> [...]
>
> [1] http://code.google.com/p/chromium/issues/detail?id=5206

http://code.google.com/p/chromium/issues/detail?id=5206#c6
| This is fixed in V8 bleeding edge, so it's on its way to Chromium.

Good post. I missed this one.

Garrett

--
comp.lang.javascript FAQ <URL: http://jibbering.com/faq/ >

Matthias Watermann

unread,

Feb 12, 2009, 8:26:14 AM2/12/09

to

On Thu, 12 Feb 2009 00:57:53 -0800, Garrett Smith wrote:

> Question: Why not:-
>
> function trimString(s){
> return s.replace(/^[\s|\xA0]+|[\s|\xA0]+$/g, '');
> }

Why would you consider the pipe char ("|") as whitespace?

--
Matthias
/"\
\ / ASCII RIBBON CAMPAIGN - AGAINST HTML MAIL
X - AGAINST M$ ATTACHMENTS
/ \

kangax

unread,

Feb 12, 2009, 9:26:50 AM2/12/09

to

Garrett Smith wrote:
[...]

> Is this fixed in recent JScript in IE8?

I'll check this later, will let you know.

>
> Question: Why not:-
>
> function trimString(s){
> return s.replace(/^[\s|\xA0]+|[\s|\xA0]+$/g, '');
> }

That would work, sure. Accidentally, google groups, which was recently
criticized here, actually uses exactly such `strim` [1]

...
S.string.trim=function(a){
return a.replace(/^[\s\xa0]+|[\s\xa0]+$/g,"")
};
...

[1]
http://groups.google.com/groups/static/release/g2_common-2cddf002493e87d5abe24a2765ad49a6.js

--
kangax

Peter May

unread,

Feb 12, 2009, 9:52:30 AM2/12/09

to

Garrett Smith pisze:
[...]

> function trimString(s){
> return s.replace(/^[\s|\xA0]+|[\s|\xA0]+$/g, '');
> }

Look for an interesting article about the trim:
http://blog.stevenlevithan.com/archives/faster-trim-javascript

--
Peter

Lasse Reichstein Nielsen

unread,

Feb 12, 2009, 5:07:11 PM2/12/09

to

kangax <kan...@gmail.com> writes:

> I was only talking about WhiteSpace (7.2) and LineTerminator (7.3)
> productions. Former one includes '\xA0' and so, yes, it should be
> matched by /\s/. I filed a bug with Chrome [1] but only after noticed
> that nightly webkit does not exhibit this (so it's only a matter of
> updating Chrome to the newer version)

RegExp actually doesn't come with WebKit, which Chrome and Safari
are sharing, but with the underlying Javascript implementation.
Chrome uses V8 and Safari uses Squirrelfish (Extreme in nighlies).

The released versions both use the same adaption of the PCRE library
for regexps, but it's being phased out. SFX are implementing WREC
and V8 have just released a new regexp engine as well. It's only
available on the Chrome developer channel releases yet.

So, you are right, the newest "release" of Chrome (but not the
newest stable release!) does implements \s correctly, but it's
completely unrelated to the version of Webkit being used.

/L
--
Lasse Reichstein Holst Nielsen
'Javascript frameworks is a disruptive technology'

Dr J R Stockton

unread,

Feb 12, 2009, 10:43:28 AM2/12/09

to

In comp.lang.javascript message <gn0o99$dtm$1...@news.motzarella.org>, Thu,
12 Feb 2009 00:57:53, Garrett Smith <dhtmlk...@gmail.com> posted:

>
>Good point. That should definitely be in the FAQ.
>
>>> Is   or   or \xA0 a 'blank' (or white-space) character ?
>
>Is this fixed in recent JScript in IE8?

The FAQ should not presume that \xA0 should be considered, by the
programmer, as RegExp whitespace. That is application-dependent. For
example, a paragraph-packer should treat it largely as if it were a
letter (if a long word MUST be broken, it might be best to do it before
a \xA0, for visibility). In that case, it may be necessary to use not
\s but a more detailed expression.

It seems likely that [ \f\n\r\t\v] will be equivalent to or a subset
of \s , at least in anything which implements \f \n \r \t \v .

Browser RegExps ought to be in accordance with ISO/IEC 16262 Sec 7.2 for
*Source* *Text*, since Sec 15.10 refers to that for "whitespace".

Richard Cornford

unread,

Feb 12, 2009, 6:59:49 PM2/12/09

to

kangax wrote:
> Garrett Smith wrote:
> [...]
>> Is this fixed in recent JScript in IE8?
>
> I'll check this later, will let you know.

If that question was whether \s in a regular expression matches '\u00A0'
in IE 8 then it appears that the answer is no.

>> Question: Why not:-
>>
>> function trimString(s){
>> return s.replace(/^[\s|\xA0]+|[\s|\xA0]+$/g, '');
>> }
>
> That would work, sure.

<snip>

By "work" I assume you mean that it corrects the fact that some
javascript engines do not match '\u00A0' (the non-breaking space
character) when \s is used in a regular expression. But is this an
attempt to create methods that conform to the ECMA (3rd Ed.)
specification, and thus are consistent across platforms, or is it a
desire to arbitrarily include the non-breaking space in the set of
matched characters for a trim function.

The former seems the more reasonable goal, but in that case this trim
function still falls short as where '\u00A0' is not matched the odds are
extremely good that '\u2003' (EM space), to name but one, is not matched
either. So the above function will still produce inconsistent results
between javascript engines.

The ECMA spec wants \s to match javascript's whit space characters and
it line terminator characters, but the definition of whitespace includes
all of Unicode's Zs group, and JScript, for example, does not match the
majority of those.

Try this out:-

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">

var WhiteSpace = [
{
cp:"9", codePoint:"0x0009", character :"\u0009",
name:"<control>[ASCII Tab]", group:"Cc"
},
{
cp:"11", codePoint:"0x000B", character :"\u000B",
name:"<control>[ASCII Vertical Tab]", group:"Cc"
},
{
cp:"12", codePoint:"0x000C", character :"\u000C",
name:"<control>[ASCII Form Feed]", group:"Cc"
},
{
cp:"32", codePoint:"0x0020", character :"\u0020",
name:"SPACE", group:"Zs"
},
{
cp:"160", codePoint:"0x00A0", character :"\u00A0",
name:"NO-BREAK SPACE", group:"Zs"
},
{
cp:"5760", codePoint:"0x1680", character :"\u1680",
name:"OGHAM SPACE MARK", group:"Zs"
},
{
cp:"6158", codePoint:"0x180E", character :"\u180E",
name:"MONGOLIAN VOWEL SEPARATOR", group:"Zs"
},
{
cp:"8192", codePoint:"0x2000", character :"\u2000",
name:"EN QUAD", group:"Zs"
},
{
cp:"8193", codePoint:"0x2001", character :"\u2001",
name:"EM QUAD", group:"Zs"
},
{
cp:"8194", codePoint:"0x2002", character :"\u2002",
name:"EN SPACE", group:"Zs"
},
{
cp:"8195", codePoint:"0x2003", character :"\u2003",
name:"EM SPACE", group:"Zs"
},
{
cp:"8196", codePoint:"0x2004", character :"\u2004",
name:"THREE-PER-EM SPACE", group:"Zs"
},
{
cp:"8197", codePoint:"0x2005", character :"\u2005",
name:"FOUR-PER-EM SPACE", group:"Zs"
},
{
cp:"8198", codePoint:"0x2006", character :"\u2006",
name:"SIX-PER-EM SPACE", group:"Zs"
},
{
cp:"8199", codePoint:"0x2007", character :"\u2007",
name:"FIGURE SPACE", group:"Zs"
},
{
cp:"8200", codePoint:"0x2008", character :"\u2008",
name:"PUNCTUATION SPACE", group:"Zs"
},
{
cp:"8201", codePoint:"0x2009", character :"\u2009",
name:"THIN SPACE", group:"Zs"
},
{
cp:"8202", codePoint:"0x200A", character :"\u200A",
name:"HAIR SPACE", group:"Zs"
},
{
cp:"8239", codePoint:"0x202F", character :"\u202F",
name:"NARROW NO-BREAK SPACE", group:"Zs"
},
{
cp:"8287", codePoint:"0x205F", character :"\u205F",
name:"MEDIUM MATHEMATICAL SPACE", group:"Zs"
},
{
cp:"12288",codePoint:"0x3000", character :"\u3000",
name:"IDEOGRAPHIC SPACE", group:"Zs"
}
];

var LineTerminator = [
{
cp:"8232", codePoint:"0x2028", character :"\u2028",
name:"LINE SEPARATOR", group:"Zl"
},
{
cp:"8233", codePoint:"0x2029", character :"\u2029",
name:"PARAGRAPH SEPARATOR", group:"Zp"
},
{
cp:"10", codePoint:"0x000A", character :"\u000A",
name:"<control>[ASCII Line Feed]", group:"Cc"
},
{
cp:"13", codePoint:"0x000D", character :"\u000D",
name:"<control>[ASCII Carriage Return]", group:"Cc"
}
];

var NonWhiteSpace = [
{
cp:"8203", codePoint:"0x200B", character :"\u200B",
name:"ZERO WIDTH SPACE (will be whitespace in ES 3.1)",
group:"Cf"
},
{
cp:"8204", codePoint:"0x200C", character :"\u200C",
name:"ZERO WIDTH NON-JOINER", group:"Cf"
},
{
cp:"8205", codePoint:"0x200D", character :"\u200D",
name:"ZERO WIDTH JOINER", group:"Cf"
},
{
cp:"65279", codePoint:"0xFEFF", character :"\uFEFF",
name:"ZERO WIDTH NO-BREAK SPACE (will be whitespace in ES 3.1)",
group:"Cf"
}
];

var testAr = [WhiteSpace, LineTerminator];

var testRX = /\s/g;
function testWS(chObj){
var char, stripped, matches;
if(!(char = String.fromCharCode(+chObj.cp))){
return 'Anomaly: None empty string is false.\n'
}

/* First verify that the String.fromCharCode result matches the
Unicode escape sequence based string literal for the character.
*/
if(char != chObj.character){
return 'Anomaly in String.String.fromCharCode('+chObj.cp+')\n'
}
if((stripped = char.replace(testRX, '')) == chObj.character){
matches = false;
}else{
matches = true;
}
return (
matches+
'\t"'+stripped+'"\tcp = '+chObj.codePoint+'\tname = '+
chObj.name+' group = '+chObj.group+'\n'
);
}

var c, cp, list, ctr;
document.write('Should be matched by \\s\n');
for(c = 0;c < testAr.length;++c){
list = testAr[c];
for(cp = 0;cp < list.length;++cp){
ctr = list[cp];
document.write(testWS(ctr)
.replace('true', 'GOOD')
.replace('false', 'FAIL'));

}
}

document.write('\n\nMust not be matched by \\s (in ES 3)\n');
for(cp = 0;cp < NonWhiteSpace.length;++cp){
ctr = NonWhiteSpace[cp];
document.write(testWS(ctr)
.replace('true', 'FAIL')
.replace('false', 'GOOD'));

}
</script>
</pre>
</body>
</html>

Richard.

kangax

unread,

Feb 12, 2009, 10:28:10 PM2/12/09

to

Richard Cornford wrote:
> kangax wrote:
>> Garrett Smith wrote:
>> [...]
>>> Is this fixed in recent JScript in IE8?
>>
>> I'll check this later, will let you know.
>
> If that question was whether \s in a regular expression matches '\u00A0'
> in IE 8 then it appears that the answer is no.

Oh well.

>
>>> Question: Why not:-
>>>
>>> function trimString(s){
>>> return s.replace(/^[\s|\xA0]+|[\s|\xA0]+$/g, '');
>>> }

I missed this first time, but `|` is not needed here (inside the
character class), is it?

>>
>> That would work, sure.
> <snip>
>
> By "work" I assume you mean that it corrects the fact that some
> javascript engines do not match '\u00A0' (the non-breaking space
> character) when \s is used in a regular expression. But is this an
> attempt to create methods that conform to the ECMA (3rd Ed.)
> specification, and thus are consistent across platforms, or is it a
> desire to arbitrarily include the non-breaking space in the set of
> matched characters for a trim function.

As I understand it, there are practical implications to regex whitespace
character class not matching \xA0 (NBSP) characters in context of `trim`
function.

A simple example demonstrating the issue would be:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd">

function trim(s) {
return s.replace(/^\s+/, '')
.replace(/\s+$/, '');
}

function trim_xA0(s) {
return s.replace(/^[\s\xA0]+/, '')
.replace(/[\s\xA0]+$/, '');
}

var el = document.createElement('div');
el.innerHTML = ' foo ';

document.write('trim: ' +
trim(el.firstChild.nodeValue).length + '<br>');

document.write('trim_xA0: ' +
trim_xA0(el.firstChild.nodeValue).length);

})();
</script>
</body>
</html>

In Firefox and Opera - "simple" trim removes "nbsp" characters. IE and
Safari need an addition of \xA0. Considering that \xA0 is part of a
White Space production (as per ES3) and that such major browsers as IE
and Opera have faulty implementations, it would seem to be a good idea
to use `\xA0`-patched `trim`.

>
> The former seems the more reasonable goal, but in that case this trim
> function still falls short as where '\u00A0' is not matched the odds are
> extremely good that '\u2003' (EM space), to name but one, is not matched
> either. So the above function will still produce inconsistent results
> between javascript engines.
>
> The ECMA spec wants \s to match javascript's whit space characters and
> it line terminator characters, but the definition of whitespace includes
> all of Unicode's Zs group, and JScript, for example, does not match the
> majority of those.

True. I missed this part of the specs (about Unicode whitespace).

>
> Try this out:-
>

[snip test]

Interesting. There's quite a bit of failures in all browsers I could test.

--
kangax

kangax

unread,

Feb 12, 2009, 10:34:04 PM2/12/09

to

Lasse Reichstein Nielsen wrote:
[...]

> RegExp actually doesn't come with WebKit, which Chrome and Safari
> are sharing, but with the underlying Javascript implementation.
> Chrome uses V8 and Safari uses Squirrelfish (Extreme in nighlies).

Ah, so WebKit is a rendering engine and has nothing to do with ES
implementation. Thanks, I wasn't aware of this : )

[...]

--
kangax

kangax

unread,

Feb 12, 2009, 11:01:20 PM2/12/09

to

kangax wrote:
> Richard Cornford wrote:
[...]

> [snip test]
>
> Interesting. There's quite a bit of failures in all browsers I could test.
>

Actually, I just download nightly WebKit (rev. 40931) and it passes all
tests (!)

--
kangax

Dr J R Stockton

unread,

Feb 13, 2009, 12:44:37 PM2/13/09

to

In comp.lang.javascript message <iYOdnUKRy682rAnUnZ2dnUVZ_vjinZ2d@gigane
ws.com>, Thu, 12 Feb 2009 09:26:50, kangax <kan...@gmail.com> posted:

>Garrett Smith wrote:
>[...]
>> Is this fixed in recent JScript in IE8?
>
>I'll check this later, will let you know.
>
>> Question: Why not:-
>> function trimString(s){
>> return s.replace(/^[\s|\xA0]+|[\s|\xA0]+$/g, '');
>> }
>
>That would work, sure. Accidentally, google groups, which was recently
>criticized here, actually uses exactly such `strim` [1]
>
>...
>S.string.trim=function(a){
> return a.replace(/^[\s\xa0]+|[\s\xa0]+$/g,"")
>};

That is not "exactly such". The first also trims leading and trailing
vertical bars; the second should not.

kangax

unread,

Feb 13, 2009, 2:55:46 PM2/13/09

to

Dr J R Stockton wrote:

[snip]

> That is not "exactly such". The first also trims leading and trailing
> vertical bars; the second should not.
>

Yep, I noticed it later on. See my follow-up to Richard's post.

--
kangax

Richard Cornford

unread,

Feb 13, 2009, 5:19:50 PM2/13/09

to

kangax wrote:
> Richard Cornford wrote:
>> kangax wrote:
>>> Garrett Smith wrote:

<snip>

>>>> function trimString(s){
>>>> return s.replace(/^[\s|\xA0]+|[\s|\xA0]+$/g, '');
>>>> }
>
> I missed this first time, but `|` is not needed here
> (inside the character class), is it?

No. But my interest was mostly with the justification for including
\u00A0 but not, say, \u202F.

>>> That would work, sure.
>> <snip>
>>
>> By "work" I assume you mean that it corrects the fact that
>> some javascript engines do not match '\u00A0' (the
>> non-breaking space character) when \s is used in a regular
>> expression. But is this an attempt to create methods that
>> conform to the ECMA (3rd Ed.) specification, and thus are
>> consistent across platforms, or is it a desire to arbitrarily
>> include the non-breaking space in the set of matched
>> characters for a trim function.
>
> As I understand it, there are practical implications to
> regex whitespace character class not matching \xA0 (NBSP)
> characters in context of `trim` function.

But there must also be practical implications in its not matching
\u202F. There may be a diminishing likelihood of any given - trim -
implementation encountering particular whitespace characters as they
become increasingly obscure, but we are already well into the obscure
when handling non-breaking space as users are going to pretty hard
pressed to enter that character into, say, and <INPUT type="text">
element. (That is, it is likely a deliberate act on the part of web
developers to include \u00A0 in a string, and so it is maybe only the
developers who chose that design who will have to deal with the issue)

<snip>> [snip test]

>
> Interesting. There's quite a bit of failures in all browsers I
> could test.

Yes, so a general - trim - that is going to be consistent across
browsers is going to have to either do quite a bit of work to compensate
for the inconsistencies in \s or abandon the use of \s in favour of a
predictable explicit character class definition.

Richard.

kangax

unread,

Feb 13, 2009, 11:21:34 PM2/13/09

to

Richard Cornford wrote:
> kangax wrote:
>> Richard Cornford wrote:
>>> kangax wrote:
>>>> Garrett Smith wrote:
> <snip>
>>>>> function trimString(s){
>>>>> return s.replace(/^[\s|\xA0]+|[\s|\xA0]+$/g, '');
>>>>> }
>>
>> I missed this first time, but `|` is not needed here
>> (inside the character class), is it?
>
> No. But my interest was mostly with the justification for including
> \u00A0 but not, say, \u202F.

Theoretically there's no reason to prefer \u00A0 over \u202F.
Practically, though, isn't \u00A0 much more widely used on the web?

Google returns 29,900 results vs. 1,270. Running
`document.documentElement.innerHTML.indexOf('\u00A0')` on CNN.com gives
8338, and -1 for \u202f. \202f also doesn't have a corresponding *named*
html entity (while \u00A0 has an infamous " ", spread over millions
of pages and articles all over the web).

Technically this is a non-argument of course, but practically I see all
the reasons to "guard" trim against such commonly used character.

A FAQ can of course simply *mention* the \s deficiencies when it comes
to matching against some of the whitespace characters, including so
"popular"   one : )

[snip]

>> As I understand it, there are practical implications to
>> regex whitespace character class not matching \xA0 (NBSP)
>> characters in context of `trim` function.
>
> But there must also be practical implications in its not matching
> \u202F. There may be a diminishing likelihood of any given - trim -
> implementation encountering particular whitespace characters as they
> become increasingly obscure, but we are already well into the obscure
> when handling non-breaking space as users are going to pretty hard
> pressed to enter that character into, say, and <INPUT type="text">
> element. (That is, it is likely a deliberate act on the part of web
> developers to include \u00A0 in a string, and so it is maybe only the
> developers who chose that design who will have to deal with the issue)

Good point about type="text" elements, but what about cases of parsing
arbitrary html content (mashups?)

[snip]

--
kangax

Richard Cornford

unread,

Feb 14, 2009, 2:04:35 PM2/14/09

to

kangax wrote:
> Richard Cornford wrote:
>> kangax wrote:
>>> Richard Cornford wrote:
>>>> kangax wrote:
>>>>> Garrett Smith wrote:
>> <snip>
>>>>>> function trimString(s){
>>>>>> return s.replace(/^[\s|\xA0]+|[\s|\xA0]+$/g, '');
>>>>>> }
>>>
>>> I missed this first time, but `|` is not needed here
>>> (inside the character class), is it?
>>
>> No. But my interest was mostly with the justification for
>> including \u00A0 but not, say, \u202F.
>
> Theoretically there's no reason to prefer \u00A0 over \u202F.
> Practically, though, isn't \u00A0 much more widely used on
> the web?

Undoubtedly.

<snip>

> Technically this is a non-argument of course, but practically
> I see all the reasons to "guard" trim against such commonly used
> character.

I have seen lots of example where a glaring mistake that probably will
be noticed is replaced by a sublet mistake that will slip in under the
radar. With the latter being justified as 'good enough'. But when 'good
enough' is to be used as a justification I prefer to see knowledge about
the context of application driving that judgement.

> A FAQ can of course simply *mention* the \s deficiencies when
> it comes to matching against some of the whitespace characters,
> including so "popular"   one : )

Actual FAQ entries are supposed to be short, and if they then cannot
accommodate sufficient explanation (or alternative approached) they can
reference more detailed exhalation elsewhere.

> [snip]
>
>>> As I understand it, there are practical implications to
>>> regex whitespace character class not matching \xA0 (NBSP)
>>> characters in context of `trim` function.
>>
>> But there must also be practical implications in its not
>> matching \u202F. There may be a diminishing likelihood of
>> any given - trim - implementation encountering particular
>> whitespace characters as they become increasingly obscure,
>> but we are already well into the obscure when handling
>> non-breaking space as users are going to pretty hard pressed to enter
>> that character into, say, and <INPUT
>> type="text"> element. (That is, it is likely a deliberate
>> act on the part of web developers to include \u00A0 in a
>> string, and so it is maybe only the developers who chose
>> that design who will have to deal with the issue)
>
> Good point about type="text" elements, but what about cases
> of parsing arbitrary html content (mashups?)

To my thinking an arbitrary string is just that; arbitrary. It may
contain any characters permissible in strings in the language. We know
what i going to be necessary to achieve consistent behaviour from a -
trim - function if it is to act on arbitrarily strings (explicit
character calls detentions in place of the use of the buggy \s).

Strings off (or derived from) HTML are something more knowable (they
don't come into existence by accident) and so with a - trim - for that
context it may be practical to apply a different set of judgement when
deciding what would be appropriate. Though without additional knowledge
of how/why it will not be possible, there still is a possibility of such
a string containing a \u202F, just at a reduced probability. We come
back to deciding what it is about _this_ HTML that makes disregarding
\u2002F 'good enough'.

Richard.

Garrett Smith

unread,

Feb 14, 2009, 5:51:48 PM2/14/09

to

Users sometimes copy-paste text into a textarea.

The trimmed value would contain whatever the user scraped up with his
mouse, minus whatever the implementation of \s had removed. The result
would vary, depending on the implementation of \s.

A function to validate the length of a textarea's trimmed value would
have to define a large character class in order to be consistent across
implementations.

> <snip>> [snip test]
>>
>> Interesting. There's quite a bit of failures in all browsers I
>> could test.
>
> Yes, so a general - trim - that is going to be consistent across
> browsers is going to have to either do quite a bit of work to compensate
> for the inconsistencies in \s or abandon the use of \s in favour of a
> predictable explicit character class definition.
>

HTML defines white space characters[1]:
* ASCII space ( )
* ASCII tab (	)
* ASCII form feed ()
* Zero-width space ()
Line breaks:
* carriage return ()
* line feed (
)

A trim function used for the purpose of normalizing a class attribute
should technically include zero-width space and form-feed, but omitting
those would be OK. That leaves:-

function trimString(s) {
return s.replace(/^[\x20\t\r\n]+|[\x20\t\r\n]+$/g,"");
}

However, even though \s is inconsistent across implementations, known
implementations include \x20\t\r\n for \s. That leaves:

trimString(s) {

return s.replace(/^\s+|\s+$/g,"");
}

Garrett

[1]http://www.w3.org/TR/REC-html40/struct/text.html#h-9.1

Dr J R Stockton

unread,

Feb 15, 2009, 8:41:53 AM2/15/09

to

In comp.lang.javascript message <gn7hu7$2f2$1...@news.motzarella.org>, Sat,
14 Feb 2009 14:51:48, Garrett Smith <dhtmlk...@gmail.com> posted:

>A function to validate the length of a textarea's trimmed value would
>have to define a large character class in order to be consistent across
>implementations.

Not necessarily large. In a given context, a small class will commonly
suffice. Generally, characters space tab and newline only, with
anything else classified as meaning what whoever put that character in
is clearly a buffoon or a troublemaker.

This discussion is pointless.

The trim routine in the FAQ should use \s. Anything more complex than
\s will obscure how the routine works to the class of reader for which
the FAQ entry is needed. Remember, one use of the entry is as a
demonstration of a simple RegExp.

The FAQ entry should also say that, although the standard more-or-less
defines \s, browsers etc. do not always implement that fully, but can be
expected to recognise at least [ \t\r\n\v\f] (assuming that most do).

The FAQ entry should also say that the appropriate definition of
whitespace depends on the circumstances of use; \s as found is usually
but not always suitable. It might specifically warn that \xA0 may or
may not be included (if so), and may or may not be wanted.

For example, there is a defined Paragraph Separator character,
PS. In JavaScript source, it should clearly be equivalent to a
single space. But in code intended to pack text paragraphs, it
should clearly not be treated that way.

--
(c) John Stockton, nr London UK. ?@merlyn.demon.co.uk BP7, Delphi 3 & 2006.
<URL:http://www.merlyn.demon.co.uk/> TP/BP/Delphi/&c., FAQqy topics & links;
<URL:http://www.bancoems.com/CompLangPascalDelphiMisc-MiniFAQ.htm> clpdmFAQ;
NOT <URL:http://support.codegear.com/newsgroups/>: news:borland.* Guidelines

Garrett Smith

unread,

Feb 16, 2009, 2:27:53 AM2/16/09

to

Dr J R Stockton wrote:

> In comp.lang.javascript message <gn7hu7$2f2$1...@news.motzarella.org>, Sat,
> 14 Feb 2009 14:51:48, Garrett Smith <dhtmlk...@gmail.com> posted:
>
>> A function to validate the length of a textarea's trimmed value would
>> have to define a large character class in order to be consistent across
>> implementations.
>
> Not necessarily large. In a given context, a small class will commonly
> suffice. Generally, characters space tab and newline only, with
> anything else classified as meaning what whoever put that character in
> is clearly a buffoon or a troublemaker.
>

A small class would usually suffice. I was thinking of a user
copy-pasting text from some other program. That clipboard text could
potentially contain other whitespace characters. In that case, if the
program tried to validate the "length" of the textarea, and wanted to
trim whitespace, the result could vary between browsers. It would be a
perfectly innocent mistake on part of the user (and an edge case).

> This discussion is pointless.
>

I learned something.

> The trim routine in the FAQ should use \s. Anything more complex than
> \s will obscure how the routine works to the class of reader for which
> the FAQ entry is needed. Remember, one use of the entry is as a
> demonstration of a simple RegExp.
>

The trim routine does use \s, but it would be a good idea to make
mention of the inconsistencies. It would be useful to have a test page
for \s.

> The FAQ entry should also say that, although the standard more-or-less
> defines \s, browsers etc. do not always implement that fully, but can be
> expected to recognise at least [ \t\r\n\v\f] (assuming that most do).
>
> The FAQ entry should also say that the appropriate definition of
> whitespace depends on the circumstances of use; \s as found is usually
> but not always suitable. It might specifically warn that \xA0 may or
> may not be included (if so), and may or may not be wanted.
>
> For example, there is a defined Paragraph Separator character,
> PS. In JavaScript source, it should clearly be equivalent to a
> single space. But in code intended to pack text paragraphs, it
> should clearly not be treated that way.
>

Are you sure? I thought that a PS would be interpreted as a line
separator[1]. A line separator affects the program.

Of course in a string, it would not be treated as a space. Or would it?

javascript: var s = "t\u2029e\u2029s\u2029t"; alert(s);

Safari interprets that character as a newline. Result:-
t
e
s
t

[1]http://bclary.com/2004/11/07/#a-7.3

Garrett

Dr J R Stockton

unread,

Feb 16, 2009, 4:19:07 PM2/16/09

to

In comp.lang.javascript message <gnb4hr$7uf$1...@news.motzarella.org>, Sun,
15 Feb 2009 23:27:53, Garrett Smith <dhtmlk...@gmail.com> posted:

>Dr J R Stockton wrote:
>> In comp.lang.javascript message <gn7hu7$2f2$1...@news.motzarella.org>, Sat,
>> 14 Feb 2009 14:51:48, Garrett Smith <dhtmlk...@gmail.com> posted:
>>
>>> A function to validate the length of a textarea's trimmed value
>>>would
>>> have to define a large character class in order to be consistent across
>>> implementations.
>> Not necessarily large. In a given context, a small class will
>>commonly
>> suffice. Generally, characters space tab and newline only, with
>> anything else classified as meaning what whoever put that character in
>> is clearly a buffoon or a troublemaker.
>>
>
>A small class would usually suffice. I was thinking of a user copy-
>pasting text from some other program. That clipboard text could
>potentially contain other whitespace characters. In that case, if the
>program tried to validate the "length" of the textarea, and wanted to
>trim whitespace, the result could vary between browsers. It would be a
>perfectly innocent mistake on part of the user (and an edge case).

Not only will the result vary between browsers, but the desired result
will vary between applications. That is why the FAQ code should be
happy with \s, but that should be supplemented by a statement that \s
varies and application needs vary.

>> The trim routine in the FAQ should use \s. Anything more complex than
>> \s will obscure how the routine works to the class of reader for which
>> the FAQ entry is needed. Remember, one use of the entry is as a
>> demonstration of a simple RegExp.
>>
>
>The trim routine does use \s, but it would be a good idea to make
>mention of the inconsistencies. It would be useful to have a test page
>for \s.

That needs a complete list of all possible Unicode spaces.

On a 3GHz PC, XP sp3, FF3, the following takes perceptible but
insignificant time to list all non-matches to \S : it could perhaps be
done better.

S = T = "" ; Max = Math.pow(2, 16)
for (J = 0 ; J<Max ; J++) S += String.fromCharCode(J)
S = S.replace(/\S/g, "")
for (J=0 ; J<S.length ; J++) T += " " + S.charCodeAt(J).toString(16)
Answer = S.length + ";" + T

FF3: 22; 9 a b c d 20 a0 2000 2001 2002 2003 2004 2005 2006 2007 2008
2009 200a 200b 2028 2029 3000

IE7: 6; 9 a b c d 20

Op9: 24; 9 a b c d 20 a0 1680 2000 2001 2002 2003 2004 2005 2006 2007
2008 2009 200a 200b 2028 2029 202f 3000

Sf3: 6; 9 a b c d 20

Ch1: 6; 9 a b c d 20

(Opera is 9.2, not 9.6)

My js-valid.htm now contains that, with improved output.

If you want to recommend running that, put a version on jibbering!

Probably one should now above also use S = S.replace(/\s/g, "") and
check that S.length = 0. And test that no character matches both of \s
\S.

Given the list of implemented Unicode spaces, one now needs their names.

See <http://www.unicode.org/charts/PDF/U0000.pdf>, entry for 0020; and
others via <http://www.unicode.org/charts/>, top. It is nice to know
that Opera includes Ogham support.

>> The FAQ entry should also say that, although the standard more-or-less
>> defines \s, browsers etc. do not always implement that fully, but can be
>> expected to recognise at least [ \t\r\n\v\f] (assuming that most do).

That seems so, from above.

>> The FAQ entry should also say that the appropriate definition of
>> whitespace depends on the circumstances of use; \s as found is usually
>> but not always suitable. It might specifically warn that \xA0 may or
>> may not be included (if so), and may or may not be wanted.
>> For example, there is a defined Paragraph Separator
>>character,
>> PS. In JavaScript source, it should clearly be equivalent to a
>> single space. But in code intended to pack text paragraphs, it
>> should clearly not be treated that way.
>>
>
>Are you sure? I thought that a PS would be interpreted as a line
>separator[1]. A line separator affects the program.

Agreed - that's what I meant. :-(

>Of course in a string, it would not be treated as a space. Or would it?
>
>javascript: var s = "t\u2029e\u2029s\u2029t"; alert(s);
>
>Safari interprets that character as a newline. Result:-

Safari is passing the string to alert, which is passing it to Windows?

Here,
FF3 shows "test"
IE7 shows illegible
Opera shows "t|e|s|t" which may match IE
Safari here shows "test"
Chrome shows illegible, probably matches Opera

One should I think prefer to test with document.write(s) and putting s
into a textarea.

kangax

unread,

Feb 16, 2009, 11:30:43 PM2/16/09

to

Dr J R Stockton wrote:

[snip]

> That needs a complete list of all possible Unicode spaces.
>
>
> On a 3GHz PC, XP sp3, FF3, the following takes perceptible but
> insignificant time to list all non-matches to \S : it could perhaps be
> done better.
>
> S = T = "" ; Max = Math.pow(2, 16)
> for (J = 0 ; J<Max ; J++) S += String.fromCharCode(J)
> S = S.replace(/\S/g, "")
> for (J=0 ; J<S.length ; J++) T += " " + S.charCodeAt(J).toString(16)
> Answer = S.length + ";" + T
>
> FF3: 22; 9 a b c d 20 a0 2000 2001 2002 2003 2004 2005 2006 2007 2008
> 2009 200a 200b 2028 2029 3000
>
> IE7: 6; 9 a b c d 20
>
> Op9: 24; 9 a b c d 20 a0 1680 2000 2001 2002 2003 2004 2005 2006 2007
> 2008 2009 200a 200b 2028 2029 202f 3000
>
> Sf3: 6; 9 a b c d 20
>
> Ch1: 6; 9 a b c d 20

Why not just use Richard's test, posted earlier in this thread? It tests
client's \s against all of the whitespace characters (including Unicode
"space separators"). Doesn't it clearly demonstrate above mentioned
oddities?

[snip]

> Given the list of implemented Unicode spaces, one now needs their names.
>
> See <http://www.unicode.org/charts/PDF/U0000.pdf>, entry for 0020; and
> others via <http://www.unicode.org/charts/>, top. It is nice to know
> that Opera includes Ogham support.

The above mentioned test lists all of the Unicode names as well.

On a side note, Firefox 3.1 beta 2 now has `String.prototype.trim` (as
well as, iirc, `leftTrim` and `rightTrim`). Firefox' internal \s
representation fails to match some of the characters and also
erroneously matches some of the non-whitespace characters (as one can
see by running the above mentioned test). This leads to
`String.prototype.trim` choking on those very same troublesome characters.

I will file a ticket with Mozilla once I get a chance (so far, only
nightly WebKit gets all of the characters right)

[snip]

--
kangax

Dr J R Stockton

unread,

Feb 17, 2009, 2:00:48 PM2/17/09

to

In comp.lang.javascript message <XOGdnfc7I6_uoAfUnZ2dnUVZ_g4LAAAA@gigane
ws.com>, Mon, 16 Feb 2009 23:30:43, kangax <kan...@gmail.com> posted:

>Dr J R Stockton wrote:

>> On a 3GHz PC, XP sp3, FF3, the following takes perceptible but
>> insignificant time to list all non-matches to \S : it could perhaps be
>> done better.

>Why not just use Richard's test, posted earlier in this thread? It
>tests client's \s against all of the whitespace characters (including
>Unicode "space separators"). Doesn't it clearly demonstrate above
>mentioned oddities?

Richard's test considers only the characters that he thinks should be
treated by \s as spaces, etc. Mine, much quicker to write, found all
characters that don't match \S in the current browser (it now uses
S.match(/\s/g)). The tests are logically distinct.

If there is a character, such as

cp:"6158", codePoint:"0x180E", character :"\u180E",
name:"MONGOLIAN VOWEL SEPARATOR", group:"Zs"

that NO browser recognises, that's not much of a worry for coders
(unless handling Mongolian) since testing on any browser will give the
same result.

Richard seems to be taking the attitude that whitespace should be all
that, and nothing more than, the Unicode standard says (because that is
what ISO/IEC 16262 says). That attitude is appropriate for writing and
testing JavaScript engines.

Mostly, though, the question should be "What does this application need
to consider as whitespace, and can I use \s for that?". In most cases,
all that is needed is [ \r\n], often with \t added; but having \v & \f
and lacking all others does no harm.

<http://oreilly.com/catalog/9780596514273/index.html> may be
worth looking into.

>> Given the list of implemented Unicode spaces, one now needs their names.
>> See <http://www.unicode.org/charts/PDF/U0000.pdf>, entry for 0020;
>>and
>> others via <http://www.unicode.org/charts/>, top. It is nice to know
>> that Opera includes Ogham support.
>
>The above mentioned test lists all of the Unicode names as well.

True, though less authoritatively. But the route I gave leads to all
other Unicode names too.

>On a side note, Firefox 3.1 beta 2 now has `String.prototype.trim` (as
>well as, iirc, `leftTrim` and `rightTrim`). Firefox' internal \s
>representation fails to match some of the characters and also
>erroneously matches some of the non-whitespace characters (as one can
>see by running the above mentioned test). This leads to
>`String.prototype.trim` choking on those very same troublesome
>characters.

Don't fall into the trap of believing that the set of characters one
needs to trim is equal to the set that is defined as whitespace.

kangax

unread,

Feb 17, 2009, 7:26:36 PM2/17/09

to

Dr J R Stockton wrote:
> In comp.lang.javascript message <XOGdnfc7I6_uoAfUnZ2dnUVZ_g4LAAAA@gigane
> ws.com>, Mon, 16 Feb 2009 23:30:43, kangax <kan...@gmail.com> posted:
>> Dr J R Stockton wrote:
>
>>> On a 3GHz PC, XP sp3, FF3, the following takes perceptible but
>>> insignificant time to list all non-matches to \S : it could perhaps be
>>> done better.
>
>
>> Why not just use Richard's test, posted earlier in this thread? It
>> tests client's \s against all of the whitespace characters (including
>> Unicode "space separators"). Doesn't it clearly demonstrate above
>> mentioned oddities?
>
> Richard's test considers only the characters that he thinks should be
> treated by \s as spaces, etc. Mine, much quicker to write, found all

That list seems very logical to me. /\s/ (CharacterClassEscape :: s) is
clearly defined in ES3's 15.10.2.12. WhiteSpace (7.2), which /\s/
references, clearly lists all of the character code points. It also
mentions Unicode space separators. Those space separators are also
clearly defined in Unicode [1] under the White_Space section.

> characters that don't match \S in the current browser (it now uses
> S.match(/\s/g)). The tests are logically distinct.
>
> If there is a character, such as
> cp:"6158", codePoint:"0x180E", character :"\u180E",
> name:"MONGOLIAN VOWEL SEPARATOR", group:"Zs"
> that NO browser recognises, that's not much of a worry for coders
> (unless handling Mongolian) since testing on any browser will give the
> same result.

Doesn't it make more sense to base tests on specs, rather than on some
vague subset of browsers? We can't really assert that "NO browser
recognizes" "MONGOLIAN VOWEL SEPARATOR"; neither can we test "all
browsers", can we?

>
> Richard seems to be taking the attitude that whitespace should be all
> that, and nothing more than, the Unicode standard says (because that is
> what ISO/IEC 16262 says). That attitude is appropriate for writing and
> testing JavaScript engines.
>
> Mostly, though, the question should be "What does this application need
> to consider as whitespace, and can I use \s for that?". In most cases,
> all that is needed is [ \r\n], often with \t added; but having \v & \f
> and lacking all others does no harm.

Absolutely. It's all about a context.

[snip]

>> On a side note, Firefox 3.1 beta 2 now has `String.prototype.trim` (as
>> well as, iirc, `leftTrim` and `rightTrim`). Firefox' internal \s
>> representation fails to match some of the characters and also
>> erroneously matches some of the non-whitespace characters (as one can
>> see by running the above mentioned test). This leads to
>> `String.prototype.trim` choking on those very same troublesome
>> characters.
>
> Don't fall into the trap of believing that the set of characters one
> needs to trim is equal to the set that is defined as whitespace.
>

I'm not : ) As it stands now, FireFox's \s is simply not ES3-compliant
and its deficiencies affect native `trim` (as that `trim` relies on \s)

[1] http://unicode.org/Public/UNIDATA/PropList.txt

--
kangax

Dr J R Stockton

unread,

Feb 18, 2009, 2:22:17 PM2/18/09

to

In comp.lang.javascript message <k4ydnY53BoUhyAbUnZ2dnUVZ_umWnZ2d@gigane
ws.com>, Tue, 17 Feb 2009 19:26:36, kangax <kan...@gmail.com> posted:

>Dr J R Stockton wrote:
>> In comp.lang.javascript message <XOGdnfc7I6_uoAfUnZ2dnUVZ_g4LAAAA@gigane
>> ws.com>, Mon, 16 Feb 2009 23:30:43, kangax <kan...@gmail.com> posted:
>>> Dr J R Stockton wrote:
>>
>>>> On a 3GHz PC, XP sp3, FF3, the following takes perceptible but
>>>> insignificant time to list all non-matches to \S : it could perhaps be
>>>> done better.
>>
>>> Why not just use Richard's test, posted earlier in this thread? It
>>> tests client's \s against all of the whitespace characters (including
>>> Unicode "space separators"). Doesn't it clearly demonstrate above
>>> mentioned oddities?
>> Richard's test considers only the characters that he thinks should
>>be
>> treated by \s as spaces, etc. Mine, much quicker to write, found all
>
>That list seems very logical to me. /\s/ (CharacterClassEscape :: s) is
>clearly defined in ES3's 15.10.2.12. WhiteSpace (7.2), which /\s/
>references, clearly lists all of the character code points. It also
>mentions Unicode space separators. Those space separators are also
>clearly defined in Unicode [1] under the White_Space section.

AFAICS, Richard's test says nothing about whether \s or \S matches
\u3000. Therefore, Richard's test cannot tell whether a browser is
fully compliant. Mine can, except for handling any character coding
outside 0x0000 to 0xFFFF.

>> characters that don't match \S in the current browser (it now uses
>> S.match(/\s/g)). The tests are logically distinct.
>> If there is a character, such as
>> cp:"6158", codePoint:"0x180E", character :"\u180E",
>> name:"MONGOLIAN VOWEL SEPARATOR", group:"Zs"
>> that NO browser recognises, that's not much of a worry for coders
>> (unless handling Mongolian) since testing on any browser will give the
>> same result.
>
>Doesn't it make more sense to base tests on specs, rather than on some
>vague subset of browsers? We can't really assert that "NO browser
>recognizes" "MONGOLIAN VOWEL SEPARATOR"; neither can we test "all
>browsers", can we?

You missed the stress in "if ... NO browser".

Test fully against specs to find out whether the tested systems are
compliant. Test browsers covering most of the market for Windows
browsers to find put what most (Windows) users will have in their
browsers. The tests are quite distinct.

However, after using my test, one only has to read the list of Unicode
whitespace characters to see how it compares with the result of my test.

>I'm not : ) As it stands now, FireFox's \s is simply not ES3-compliant
>and its deficiencies affect native `trim` (as that `trim` relies on \s)

But whether that is important for a particular page depends on whether
any incorrectly-classed characters can appear within it, and (if they
do) whether the difference really matters.

Consider reading an ISO 8601 date-and-time, found in the text of a
document. ISO 8601:2000 required a 'T' in the middle; it does not allow
't', but that should generally be tolerated. ISO 8601:2004 allows a
space instead, without (AFAIR, ICBW) actually specifying \x20 or \xA0.
In practice, the text may get paragraph-packed, so a reader should
accept a newline followed by spaces and HTabs. But perhaps not two
newlines. But maybe a form feed surrounded by newlines should count as
a newline. And one should ignore page headers and footers. But the
chances of finding a Mongolian character (which might look like a space)
are, in non-Mongolian contexts, negligible.

kangax

unread,

Feb 19, 2009, 10:37:59 AM2/19/09

to

Dr J R Stockton wrote:
> In comp.lang.javascript message <k4ydnY53BoUhyAbUnZ2dnUVZ_umWnZ2d@gigane
> ws.com>, Tue, 17 Feb 2009 19:26:36, kangax <kan...@gmail.com> posted:

[snip]

>> That list seems very logical to me. /\s/ (CharacterClassEscape :: s) is
>> clearly defined in ES3's 15.10.2.12. WhiteSpace (7.2), which /\s/
>> references, clearly lists all of the character code points. It also
>> mentions Unicode space separators. Those space separators are also
>> clearly defined in Unicode [1] under the White_Space section.
>
> AFAICS, Richard's test says nothing about whether \s or \S matches
> \u3000. Therefore, Richard's test cannot tell whether a browser is
> fully compliant. Mine can, except for handling any character coding
> outside 0x0000 to 0xFFFF.

\u3000 is the last item in `WhiteSpace` array in that test:

...

{
cp:"12288",codePoint:"0x3000", character :"\u3000",
name:"IDEOGRAPHIC SPACE", group:"Zs"
}

...

[snip]

>> Doesn't it make more sense to base tests on specs, rather than on some
>> vague subset of browsers? We can't really assert that "NO browser
>> recognizes" "MONGOLIAN VOWEL SEPARATOR"; neither can we test "all
>> browsers", can we?
>
> You missed the stress in "if ... NO browser".
>
>
> Test fully against specs to find out whether the tested systems are
> compliant. Test browsers covering most of the market for Windows
> browsers to find put what most (Windows) users will have in their
> browsers. The tests are quite distinct.

What does Windows have to do with this? I do understand that tests are
distinct and one might "catch" characters that the other one wouldn't.
Richard's one can be described as - "check \s compliance of a given
browser as per ES3", while yours, as - "check which characters fall (or
not) under the successful match of \s for a given browser".

I agree that there's no "right" test in this case. They simply have
different intentions : )

>
> However, after using my test, one only has to read the list of Unicode
> whitespace characters to see how it compares with the result of my test.
>
>
>
>> I'm not : ) As it stands now, FireFox's \s is simply not ES3-compliant
>> and its deficiencies affect native `trim` (as that `trim` relies on \s)
>
> But whether that is important for a particular page depends on whether
> any incorrectly-classed characters can appear within it, and (if they
> do) whether the difference really matters.

Of course. Generally speaking, though, if I were to use native `trim` in
clients that support it (e.g. FF3.1+), I would want to know exactly
which characters it matches (for obvious reasons). Even better, I would
want native `trim` to be fully ES3-compliant, as that would make it a
viable replacement for a "custom" trim.

>
> Consider reading an ISO 8601 date-and-time, found in the text of a
> document. ISO 8601:2000 required a 'T' in the middle; it does not allow
> 't', but that should generally be tolerated. ISO 8601:2004 allows a
> space instead, without (AFAIR, ICBW) actually specifying \x20 or \xA0.
> In practice, the text may get paragraph-packed, so a reader should
> accept a newline followed by spaces and HTabs. But perhaps not two
> newlines. But maybe a form feed surrounded by newlines should count as
> a newline. And one should ignore page headers and footers. But the
> chances of finding a Mongolian character (which might look like a space)
> are, in non-Mongolian contexts, negligible.
>
>

Sounds like a case of underspecification.

--
kangax

Dr J R Stockton

unread,

Feb 19, 2009, 1:51:00 PM2/19/09

to

In comp.lang.javascript message <lc2dneB-uvlK4QDUnZ2dnUVZ_iyWnZ2d@gigane
ws.com>, Thu, 19 Feb 2009 10:37:59, kangax <kan...@gmail.com> posted:

>Dr J R Stockton wrote:
>> In comp.lang.javascript message <k4ydnY53BoUhyAbUnZ2dnUVZ_umWnZ2d@gigane
>> ws.com>, Tue, 17 Feb 2009 19:26:36, kangax <kan...@gmail.com> posted:

>\u3000 is the last item in `WhiteSpace` array in that test:
>
>...
> {
> cp:"12288",codePoint:"0x3000", character :"\u3000",
> name:"IDEOGRAPHIC SPACE", group:"Zs"
> }
>...

Alas, I was trying to indicate a character that Richard did not test,
and missed seeing that one. IIRC, there are some characters in the full
list that might be thought to be space-like, but are not in the
whitespace list.

>>> Doesn't it make more sense to base tests on specs, rather than on some
>>> vague subset of browsers? We can't really assert that "NO browser
>>> recognizes" "MONGOLIAN VOWEL SEPARATOR"; neither can we test "all
>>> browsers", can we?
>> You missed the stress in "if ... NO browser".
>> Test fully against specs to find out whether the tested systems are
>> compliant. Test browsers covering most of the market for Windows
>> browsers to find put what most (Windows) users will have in their
>> browsers. The tests are quite distinct.
>
>What does Windows have to do with this?

It's a well-known operating system; substitute any other that can
support browsing.

> I do understand that tests are distinct and one might "catch"
>characters that the other one wouldn't. Richard's one can be described
>as - "check \s compliance of a given browser as per ES3", while yours,
>as - "check which characters fall (or not) under the successful match
>of \s for a given browser".
>
>I agree that there's no "right" test in this case. They simply have
>different intentions : )

Richard's test does not check that \s does not accept \u6789, whatever
that might be; mine tests \s for all characters, so including those that
Richard tests. I can easily picture someone adding \u005F to the "list
in \s" as a debugging convenience and forgetting to remove it ... .

>> ISO 8601:2004 allows a
>> space instead, without (AFAIR, ICBW) actually specifying \x20 or \xA0.

>Sounds like a case of underspecification.

It's not a perfect specification (few are) : see
<URL:http://www.merlyn.demon.co.uk/u-8601-3.txt> and its predecessor 2.
IMHO, ECMA should have listed all the whitespace characters then in
Unicode.