nbsp in \s, <?ws> and <>

Juerd

unread,

Apr 15, 2005, 5:44:03 PM4/15/05

to perl6-l...@perl.org

Is there a <?ws>-like thingy that is always \s+?

Do \s and <?ws> match non-breaking whitespace, U+00A0?

How about:

U+0008 backspace
U+00A0 no break space (Repeated for overview)
U+1361 ethiopic wordspace
U+2000 en quad
U+2001 em quad
U+2002 en space
U+2003 em space
U+2004 three per em space
U+2005 four per em space
U+2006 six per em space
U+2007 figure space
U+2008 punctuation space
U+2009 thin space
U+200A hair space
U+200B zero width space
U+202F narrow no break space
U+205F medium mathematic space
U+2060 word joiner (What is that, anyway?)
U+3000 ideographic space
U+FEFF zero width non-breaking space

\s is said (in S05) to match any unicode whitespace, but letting it
match NBSP and then using \s for splitting things is wrong, I think.

Are the contents of <> split using <?ws>? (Is <<$foo>>, where $foo is
"foo\xA0bar", one or two elements?)

Juerd
--
http://convolution.nl/maak_juerd_blij.html
http://convolution.nl/make_juerd_happy.html
http://convolution.nl/gajigu_juerd_n.html

Aaron Sherman

unread,

Apr 15, 2005, 6:20:12 PM4/15/05

to Juerd, Perl6 Language List

On Fri, 2005-04-15 at 17:44, Juerd wrote:
> Is there a <?ws>-like thingy that is always \s+?

Not sure what that means exactly.

> Do \s and <?ws> match non-breaking whitespace, U+00A0?

As I understood, Perl 6 was going to use the Unicode standard(s) to
determine the whitespacishness of each codepoint. Going to Google, I
find:

http://www.fileformat.info/info/unicode/category/Zs/list.htm

which lists all of the "separator, space" characters.

> How about:
>
> U+0008 backspace
Character.isWhitespace() No

> U+00A0 no break space (Repeated for overview)

Character.isWhitespace() No
> U+1361 ethiopic wordspace
Character.isWhitespace() No
> U+2000 en quad
Character.isWhitespace() Yes
> U+2001 em quad
Character.isWhitespace() Yes
> U+2002 en space
Character.isWhitespace() Yes
> U+2003 em space
Character.isWhitespace() Yes

> U+2004 three per em space

Character.isWhitespace() Yes

> U+2005 four per em space

Character.isWhitespace() Yes

> U+2006 six per em space

Character.isWhitespace() Yes
> U+2007 figure space
Character.isWhitespace() No
> U+2008 punctuation space
Character.isWhitespace() Yes
> U+2009 thin space
Character.isWhitespace() Yes
> U+200A hair space
Character.isWhitespace() Yes
> U+200B zero width space
Character.isWhitespace() Yes

> U+202F narrow no break space

Character.isWhitespace() No
> U+205F medium mathematic space
Character.isWhitespace() Yes

> U+2060 word joiner (What is that, anyway?)

Character.isWhitespace() No
Comments WJ
a zero width non-breaking space (only)
intended for disambiguation of functions for byte order mark
> U+3000 ideographic space
Character.isWhitespace() Yes

> U+FEFF zero width non-breaking space

Character.isWhitespace() No

> \s is said (in S05) to match any unicode whitespace, but letting it
> match NBSP and then using \s for splitting things is wrong, I think.

Thankfully, NBSP (U+00A0) is not Unicode whitespace.

--
Aaron Sherman <a...@ajs.com>
Senior Systems Engineer and Toolsmith
"It's the sound of a satellite saying, 'get me down!'" -Shriekback

Juerd

unread,

Apr 15, 2005, 6:24:48 PM4/15/05

to Aaron Sherman, Perl6 Language List

Aaron Sherman skribis 2005-04-15 18:20 (-0400):

> > Is there a <?ws>-like thingy that is always \s+?
> Not sure what that means exactly.

<?ws> is \s* or \s+, depending on its surroundings.

> Thankfully, NBSP (U+00A0) is not Unicode whitespace.

Thanks for sharing this information!

Larry Wall

unread,

Apr 15, 2005, 6:38:54 PM4/15/05

to perl6-l...@perl.org

On Fri, Apr 15, 2005 at 11:44:03PM +0200, Juerd wrote:
: Is there a <?ws>-like thingy that is always \s+?

Not currently, since \s+ is there. <?ws> used to be that, but
currently is defined as the magical whitespace matcher used by :words.

: Do \s and <?ws> match non-breaking whitespace, U+00A0?

Yes.

: How about:

:
: U+0008 backspace
: U+00A0 no break space (Repeated for overview)
: U+1361 ethiopic wordspace
: U+2000 en quad
: U+2001 em quad
: U+2002 en space
: U+2003 em space
: U+2004 three per em space
: U+2005 four per em space
: U+2006 six per em space
: U+2007 figure space
: U+2008 punctuation space
: U+2009 thin space
: U+200A hair space
: U+200B zero width space
: U+202F narrow no break space
: U+205F medium mathematic space
: U+2060 word joiner (What is that, anyway?)
: U+3000 ideographic space
: U+FEFF zero width non-breaking space

Yes, any Unicode whitespace, but you seem to have a different list than
I do. Outside of the standard ASCIIish control-character whitespace,
I count only the \pZ characters, not the \pC characters, so I don't have
to tell you what a word-joiner is, since it's a \p[Cf] character. :-)

I will also gleefully ignore the existence of BOMs.

So I make it:

0020;SPACE;Zs;0;WS;;;;;N;;;;;
00A0;NO-BREAK SPACE;Zs;0;CS;<noBreak> 0020;;;;N;NON-BREAKING SPACE;;;;
1680;OGHAM SPACE MARK;Zs;0;WS;;;;;N;;;;;
180E;MONGOLIAN VOWEL SEPARATOR;Zs;0;WS;;;;;N;;;;;
2000;EN QUAD;Zs;0;WS;2002;;;;N;;;;;
2001;EM QUAD;Zs;0;WS;2003;;;;N;;;;;
2002;EN SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
2003;EM SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
2004;THREE-PER-EM SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
2005;FOUR-PER-EM SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
2006;SIX-PER-EM SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
2007;FIGURE SPACE;Zs;0;WS;<noBreak> 0020;;;;N;;;;;
2008;PUNCTUATION SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
2009;THIN SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
200A;HAIR SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
200B;ZERO WIDTH SPACE;Zs;0;BN;;;;;N;;;;;
2028;LINE SEPARATOR;Zl;0;WS;;;;;N;;;;;
2029;PARAGRAPH SEPARATOR;Zp;0;B;;;;;N;;;;;
202F;NARROW NO-BREAK SPACE;Zs;0;WS;<noBreak> 0020;;;;N;;;;;
205F;MEDIUM MATHEMATICAL SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
3000;IDEOGRAPHIC SPACE;Zs;0;WS;<wide> 0020;;;;N;;;;;

: \s is said (in S05) to match any unicode whitespace, but letting it

: match NBSP and then using \s for splitting things is wrong, I think.

Perhaps the default word split should not be based on \s then.
It's just one more difference, in addition to trimming leading and
trailing whitespace like awk.

: Are the contents of <> split using <?ws>? (Is <<$foo>>, where $foo is

: "foo\xA0bar", one or two elements?)

That is using the default word splitter (or it *is* the default word
splitter), so if the default word split is based on <+[\s]-[\xA0]>
it would be one element.

Of course, the ZERO WIDTH SPACE is a nasty critter for anyone using
whitespace to separate tokens. That and maybe thin spaces probably
merit warnings in Perl code where they might cause visual ambiguity.

Larry

Juerd

unread,

Apr 15, 2005, 6:46:47 PM4/15/05

to perl6-l...@perl.org

Larry Wall skribis 2005-04-15 15:38 (-0700):

> : Do \s and <?ws> match non-breaking whitespace, U+00A0?
> Yes.

That makes \s+ and \s*, and thus <?ws> very useless for anything but
trimming whitespace. For splitting (including word wrapping), it'd do
exactly the wrong thing.

> : \s is said (in S05) to match any unicode whitespace, but letting it
> : match NBSP and then using \s for splitting things is wrong, I think.
> Perhaps the default word split should not be based on \s then.

It'd have to.

Larry Wall

unread,

Apr 15, 2005, 6:56:14 PM4/15/05

to perl6-l...@perl.org

On Sat, Apr 16, 2005 at 12:46:47AM +0200, Juerd wrote:
: Larry Wall skribis 2005-04-15 15:38 (-0700):

: > : Do \s and <?ws> match non-breaking whitespace, U+00A0?
: > Yes.
:
: That makes \s+ and \s*, and thus <?ws> very useless for anything but
: trimming whitespace. For splitting (including word wrapping), it'd do
: exactly the wrong thing.

Maybe we just need a <bws> for breaking white space, or some such.
<?ws> is primarily used in pattern matching with :w, where a
non-breaking space in the input would presumably be matched by a
non-breaking space in the pattern, or maybe an explicit <nbsp>.
As long as patterns (with or without :w) treat non-breaking spaces
as ordinary matching characters, it should work out, methinks.
Though it's probably a hair more readable to use an explicit <nbsp>...

Larry

Mark Reed

unread,

Apr 15, 2005, 7:13:18 PM4/15/05

to Larry Wall

I thought we had just established that nbsp is not in Unicode¹s definition
of whitespace. So why should \s match it?