Do \s and <?ws> match non-breaking whitespace, U+00A0?
How about:
U+0008 backspace
U+00A0 no break space (Repeated for overview)
U+1361 ethiopic wordspace
U+2000 en quad
U+2001 em quad
U+2002 en space
U+2003 em space
U+2004 three per em space
U+2005 four per em space
U+2006 six per em space
U+2007 figure space
U+2008 punctuation space
U+2009 thin space
U+200A hair space
U+200B zero width space
U+202F narrow no break space
U+205F medium mathematic space
U+2060 word joiner (What is that, anyway?)
U+3000 ideographic space
U+FEFF zero width non-breaking space
\s is said (in S05) to match any unicode whitespace, but letting it
match NBSP and then using \s for splitting things is wrong, I think.
Are the contents of <> split using <?ws>? (Is <<$foo>>, where $foo is
"foo\xA0bar", one or two elements?)
Juerd
--
http://convolution.nl/maak_juerd_blij.html
http://convolution.nl/make_juerd_happy.html
http://convolution.nl/gajigu_juerd_n.html
Not sure what that means exactly.
> Do \s and <?ws> match non-breaking whitespace, U+00A0?
As I understood, Perl 6 was going to use the Unicode standard(s) to
determine the whitespacishness of each codepoint. Going to Google, I
find:
http://www.fileformat.info/info/unicode/category/Zs/list.htm
which lists all of the "separator, space" characters.
> How about:
>
> U+0008 backspace
Character.isWhitespace() No
> U+00A0 no break space (Repeated for overview)
Character.isWhitespace() No
> U+1361 ethiopic wordspace
Character.isWhitespace() No
> U+2000 en quad
Character.isWhitespace() Yes
> U+2001 em quad
Character.isWhitespace() Yes
> U+2002 en space
Character.isWhitespace() Yes
> U+2003 em space
Character.isWhitespace() Yes
> U+2004 three per em space
Character.isWhitespace() Yes
> U+2005 four per em space
Character.isWhitespace() Yes
> U+2006 six per em space
Character.isWhitespace() Yes
> U+2007 figure space
Character.isWhitespace() No
> U+2008 punctuation space
Character.isWhitespace() Yes
> U+2009 thin space
Character.isWhitespace() Yes
> U+200A hair space
Character.isWhitespace() Yes
> U+200B zero width space
Character.isWhitespace() Yes
> U+202F narrow no break space
Character.isWhitespace() No
> U+205F medium mathematic space
Character.isWhitespace() Yes
> U+2060 word joiner (What is that, anyway?)
Character.isWhitespace() No
Comments WJ
a zero width non-breaking space (only)
intended for disambiguation of functions for byte order mark
> U+3000 ideographic space
Character.isWhitespace() Yes
> U+FEFF zero width non-breaking space
Character.isWhitespace() No
> \s is said (in S05) to match any unicode whitespace, but letting it
> match NBSP and then using \s for splitting things is wrong, I think.
Thankfully, NBSP (U+00A0) is not Unicode whitespace.
--
Aaron Sherman <a...@ajs.com>
Senior Systems Engineer and Toolsmith
"It's the sound of a satellite saying, 'get me down!'" -Shriekback
<?ws> is \s* or \s+, depending on its surroundings.
> Thankfully, NBSP (U+00A0) is not Unicode whitespace.
Thanks for sharing this information!
Not currently, since \s+ is there. <?ws> used to be that, but
currently is defined as the magical whitespace matcher used by :words.
: Do \s and <?ws> match non-breaking whitespace, U+00A0?
Yes.
: How about:
:
: U+0008 backspace
: U+00A0 no break space (Repeated for overview)
: U+1361 ethiopic wordspace
: U+2000 en quad
: U+2001 em quad
: U+2002 en space
: U+2003 em space
: U+2004 three per em space
: U+2005 four per em space
: U+2006 six per em space
: U+2007 figure space
: U+2008 punctuation space
: U+2009 thin space
: U+200A hair space
: U+200B zero width space
: U+202F narrow no break space
: U+205F medium mathematic space
: U+2060 word joiner (What is that, anyway?)
: U+3000 ideographic space
: U+FEFF zero width non-breaking space
Yes, any Unicode whitespace, but you seem to have a different list than
I do. Outside of the standard ASCIIish control-character whitespace,
I count only the \pZ characters, not the \pC characters, so I don't have
to tell you what a word-joiner is, since it's a \p[Cf] character. :-)
I will also gleefully ignore the existence of BOMs.
So I make it:
0020;SPACE;Zs;0;WS;;;;;N;;;;;
00A0;NO-BREAK SPACE;Zs;0;CS;<noBreak> 0020;;;;N;NON-BREAKING SPACE;;;;
1680;OGHAM SPACE MARK;Zs;0;WS;;;;;N;;;;;
180E;MONGOLIAN VOWEL SEPARATOR;Zs;0;WS;;;;;N;;;;;
2000;EN QUAD;Zs;0;WS;2002;;;;N;;;;;
2001;EM QUAD;Zs;0;WS;2003;;;;N;;;;;
2002;EN SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
2003;EM SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
2004;THREE-PER-EM SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
2005;FOUR-PER-EM SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
2006;SIX-PER-EM SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
2007;FIGURE SPACE;Zs;0;WS;<noBreak> 0020;;;;N;;;;;
2008;PUNCTUATION SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
2009;THIN SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
200A;HAIR SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
200B;ZERO WIDTH SPACE;Zs;0;BN;;;;;N;;;;;
2028;LINE SEPARATOR;Zl;0;WS;;;;;N;;;;;
2029;PARAGRAPH SEPARATOR;Zp;0;B;;;;;N;;;;;
202F;NARROW NO-BREAK SPACE;Zs;0;WS;<noBreak> 0020;;;;N;;;;;
205F;MEDIUM MATHEMATICAL SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
3000;IDEOGRAPHIC SPACE;Zs;0;WS;<wide> 0020;;;;N;;;;;
: \s is said (in S05) to match any unicode whitespace, but letting it
: match NBSP and then using \s for splitting things is wrong, I think.
Perhaps the default word split should not be based on \s then.
It's just one more difference, in addition to trimming leading and
trailing whitespace like awk.
: Are the contents of <> split using <?ws>? (Is <<$foo>>, where $foo is
: "foo\xA0bar", one or two elements?)
That is using the default word splitter (or it *is* the default word
splitter), so if the default word split is based on <+[\s]-[\xA0]>
it would be one element.
Of course, the ZERO WIDTH SPACE is a nasty critter for anyone using
whitespace to separate tokens. That and maybe thin spaces probably
merit warnings in Perl code where they might cause visual ambiguity.
Larry
That makes \s+ and \s*, and thus <?ws> very useless for anything but
trimming whitespace. For splitting (including word wrapping), it'd do
exactly the wrong thing.
> : \s is said (in S05) to match any unicode whitespace, but letting it
> : match NBSP and then using \s for splitting things is wrong, I think.
> Perhaps the default word split should not be based on \s then.
It'd have to.
Maybe we just need a <bws> for breaking white space, or some such.
<?ws> is primarily used in pattern matching with :w, where a
non-breaking space in the input would presumably be matched by a
non-breaking space in the pattern, or maybe an explicit <nbsp>.
As long as patterns (with or without :w) treat non-breaking spaces
as ordinary matching characters, it should work out, methinks.
Though it's probably a hair more readable to use an explicit <nbsp>...
Larry