Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

What should \s \w \d match in 5.12?

65 views
Skip to first unread message

Karl Williamson

unread,
Sep 30, 2009, 1:22:57 PM9/30/09
to Perl5 Porters, yves....@booking.com, Tom Christiansen
I had thought in our discussion last year that we had determined that
these should match only in the ASCII range. And so, I thought that when
Yves flipped the switch on the \p{Posix} matches, that these would
change as well, but that isn't the case:
perl -E "say chr(0x2028) =~ /\s/"
1

in blead.

If I'm wrong about the agreement, I would like to start another
discussion, and my initial position is that they should only match in
the ASCII range.

Zefram

unread,
Sep 30, 2009, 2:47:14 PM9/30/09
to Perl5 Porters
karl williamson wrote:
>discussion, and my initial position is that they should only match in
>the ASCII range.

Yes please. The ASCII versions are very commonly required, deserving of
a shorthand, and currently lack any abbreviated form at all. Matching
extendable sets of Unicode characters is a much less common requirement,
and can already be expressed in explicitly-Unicode-based ways.

-zefram

Demerphq

unread,
Oct 1, 2009, 5:28:25 PM10/1/09
to karl williamson, Perl5 Porters, yves....@booking.com, Tom Christiansen
2009/9/30 karl williamson <pub...@khwilliamson.com>:

> I had thought in our discussion last year that we had determined that these
> should match only in the ASCII range.  And so, I thought that when Yves
> flipped the switch on the \p{Posix} matches, that these would change as
> well, but that isn't the case:
>  perl -E "say chr(0x2028) =~ /\s/"
> 1
>
> in blead.

Im inclined to say it just slipped me by. Ill poke it with a stick
when i get a chance.

> If I'm wrong about the agreement, I would like to start another discussion,
> and my initial position is that they should only match in the ASCII range.

Agreed.

2009/9/30 Zefram <zef...@fysh.org>:


> Yes please. The ASCII versions are very commonly required, deserving of
> a shorthand, and currently lack any abbreviated form at all. Matching
> extendable sets of Unicode characters is a much less common requirement,
> and can already be expressed in explicitly-Unicode-based ways.

Yes i concur.

Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

Karl Williamson

unread,
Oct 1, 2009, 9:09:17 PM10/1/09
to demerphq, Perl5 Porters, yves....@booking.com, Tom Christiansen
demerphq wrote:
> 2009/9/30 karl williamson <pub...@khwilliamson.com>:
>> I had thought in our discussion last year that we had determined that these
>> should match only in the ASCII range. And so, I thought that when Yves
>> flipped the switch on the \p{Posix} matches, that these would change as
>> well, but that isn't the case:
>> perl -E "say chr(0x2028) =~ /\s/"
>> 1
>>
>> in blead.
>
> Im inclined to say it just slipped me by. Ill poke it with a stick
> when i get a chance.
>
>> If I'm wrong about the agreement, I would like to start another discussion,
>> and my initial position is that they should only match in the ASCII range.
>
> Agreed.

Just to be precise about it, I neglected to mention that my statement
was meant only to apply in the absence of a "use locale", and whatever
the base C library routines do on an EBCDIC system. I wasn't advocating
changing the behavior under those circumstances.

Tatsuhiko Miyagawa

unread,
Oct 3, 2009, 5:56:27 PM10/3/09
to karl williamson, demerphq, Perl5 Porters, yves....@booking.com, Tom Christiansen
I was looking at perl5110delta and surprised (and a bit upset) to see
the \d \w \s changes mentioned:

I toyed with a small piece of code and seems it's not working as
specified in delta anyway:
http://gist.github.com/200900

So apparently the delta is not correct, or delta is trying to specify
what *will* be changed but not done yet?

Anyway, I have tons of scripts that rely on \d matching Japanese
numbers and \s matches with full-width space etc. Being able to have a
pragma to enable/disable the new behavior would be very nice. (I
understand I can start rewriting those \d to like \p{IsDigit} to be
forward compatbile, though)

On Thu, Oct 1, 2009 at 6:09 PM, karl williamson <pub...@khwilliamson.com> wrote:
> demerphq wrote:
>>
>> 2009/9/30 karl williamson <pub...@khwilliamson.com>:
>>>
>>> I had thought in our discussion last year that we had determined that
>>> these
>>> should match only in the ASCII range.  And so, I thought that when Yves
>>> flipped the switch on the \p{Posix} matches, that these would change as
>>> well, but that isn't the case:
>>>  perl -E "say chr(0x2028) =~ /\s/"
>>> 1
>>>
>>> in blead.
>>
>> Im inclined to say it just slipped me by. Ill poke it with a stick
>> when i get a chance.
>>
>>> If I'm wrong about the agreement, I would like to start another
>>> discussion,
>>> and my initial position is that they should only match in the ASCII
>>> range.
>>
>> Agreed.
>
> Just to be precise about it, I neglected to mention that my statement was
> meant only to apply in the absence of a "use locale", and whatever the base
> C library routines do on an EBCDIC system.  I wasn't advocating changing the
> behavior under those circumstances.
>


--
Tatsuhiko Miyagawa

Karl Williamson

unread,
Oct 3, 2009, 9:33:46 PM10/3/09
to Tatsuhiko Miyagawa, demerphq, Perl5 Porters, yves....@booking.com, Tom Christiansen
Tatsuhiko Miyagawa wrote:
> I was looking at perl5110delta and surprised (and a bit upset) to see
> the \d \w \s changes mentioned:
>
> I toyed with a small piece of code and seems it's not working as
> specified in delta anyway:
> http://gist.github.com/200900
>
> So apparently the delta is not correct, or delta is trying to specify
> what *will* be changed but not done yet?
>

Yes, the delta is not correct, but gives the current plan, so that
should be what happens.

> Anyway, I have tons of scripts that rely on \d matching Japanese
> numbers and \s matches with full-width space etc. Being able to have a
> pragma to enable/disable the new behavior would be very nice. (I
> understand I can start rewriting those \d to like \p{IsDigit} to be
> forward compatbile, though)
>

Note that the 'Is' is optional. The chart in the delta gives the
mappings for \s and \w as well. Note that if you can accept a vertical
tab in \s, that \p{Space} is shorter.

There are plans for a pragma for other unicode incompatibilities, and a
git branch that includes the beginnings of one: "use legacy". I had
thought that these changes could be controlled by a pragma, and I hope
that it is this one.

Jesse

unread,
Oct 3, 2009, 9:43:25 PM10/3/09
to karl williamson, Tatsuhiko Miyagawa, demerphq, Perl5 Porters, yves....@booking.com, Tom Christiansen


> There are plans for a pragma for other unicode incompatibilities, and a
> git branch that includes the beginnings of one: "use legacy". I had
> thought that these changes could be controlled by a pragma, and I hope
> that it is this one.

If the changes will be controlled by a pragma, what's the point of
forcing existing code to 'use legacy' rather than making these changes
part of 'use 5.12'?

We've always had a strong culture of not gratuitously breaking backwards
compatibility. This seems like a strange thing to choose to throw that
away on.

-Jesse

Tatsuhiko Miyagawa

unread,
Oct 4, 2009, 12:37:07 AM10/4/09
to jesse, karl williamson, demerphq, Perl5 Porters, yves....@booking.com, Tom Christiansen
On Sat, Oct 3, 2009 at 6:43 PM, jesse <je...@fsck.com> wrote:

> If the changes will be controlled by a pragma, what's the point of
> forcing existing code to 'use legacy' rather than making these changes
> part of 'use 5.12'?

+1

--
Tatsuhiko Miyagawa

Karl Williamson

unread,
Oct 4, 2009, 12:34:12 AM10/4/09
to jesse, Tatsuhiko Miyagawa, demerphq, Perl5 Porters, yves....@booking.com, Tom Christiansen

There has been plenty of discussion about this over the years. The
simple explanation is that the current scheme is broken in various ways.
In the things I'm working on fixing, the breakage is essentially that
the internal storage detail (utf8 or not) of strings changes their
external semantics. This has gone on for a long time, and it leads to
all sorts of unexpected results, for example when Perl decides for any
number of reasons to change the storage type of a string. However,
whenever any product is in the field long enough, people come to rely on
its bugs. So, we are planning to add a pragma for those relatively few
who rely on the broken behavior. For most, programs will actually work
more correctly.

For the \d, etc things, there are a number of arguments for changing
their behavior. The one I can think of right now, is that it currently
can be a security threat, in that most perl programs out there are not
expecting Unicode at all, and so having, eg., \d match not just 10
things but 411, or \w match not just 63 things but 101,685 can lead to
lots of unintended consequences. It seems better that the program has
to explicitly indicate that it is prepared to handle these expanded cases.

Tatsuhiko Miyagawa

unread,
Oct 4, 2009, 12:45:04 AM10/4/09
to karl williamson, jesse, demerphq, Perl5 Porters, yves....@booking.com, Tom Christiansen
On Sat, Oct 3, 2009 at 9:34 PM, karl williamson <pub...@khwilliamson.com> wrote:

> There has been plenty of discussion about this over the years.  The simple
> explanation is that the current scheme is broken in various ways.  In the
> things I'm working on fixing, the breakage is essentially that the internal
> storage detail (utf8 or not) of strings changes their external semantics.

Yes, that has been really annoying and I appreciate your efforts to fix that.

> For the \d, etc things, there are a number of arguments for changing their
> behavior.

Yes :)

> The one I can think of right now, is that it currently can be a
> security threat, in that most perl programs out there are not expecting
> Unicode at all, and so having, eg., \d  match not just 10 things but 411, or
> \w match not just 63 things but 101,685 can lead to lots of unintended
> consequences.  It seems better that the program has to explicitly indicate
> that it is prepared to handle these expanded cases.

I understand what your policy is about this, but I see expressions
like "most perl programs" and "(The ASCII versions are) very commonly
required" (earlier in this thread) that kind of upsets me, because
that's not what I expect in most modern perl programs I write both
personally and at work.

--
Tatsuhiko Miyagawa

Demerphq

unread,
Oct 4, 2009, 3:48:40 AM10/4/09
to jesse, karl williamson, Tatsuhiko Miyagawa, Perl5 Porters, yves....@booking.com, Tom Christiansen
2009/10/4 jesse <je...@fsck.com>:

Let me outline the problem here a little.

With the schizo behaviour of Perl string semantics especially with
regard to the regex engine there is no way to fix some of the problems
without introducing breakage *somewhere*.

An example for instance is the behaviour of \d, or of [[:alpha:]]
which match different things "in unicode mode" as they do in
"non-unicode mode".

This completely breaks the charclass logic resulting in POSIX
charclasses and their negations under unicode matching the same
characters(!!!!!), amongst other intriguing bugs.

There is no way to fix these problems without changing the defined
behaviour of these constructs.

There are many many problems like this. For instance the *hell* that
\xDF causes because it matches 'ss' in unicode case insensitively, and
doesnt match anything in non-unicode.

So the plan is to fix this stuff so it is consistant, and deal with
any incidental breakage as we can.

With regard to backwards compatibility, I actually have NO plans to
introduce EITHER pragma OR Feature flags to enable the old behaviour.
The old behaviour is buggy, broken and internally inconsistant. We do
NOT provide flags to reenable old fixed bugs for anything else, why
should we do it with the regex engine?

Now, before the steam starts flowing from your ears, Ill let you in on
a little secret:

The user community can do this itself.

You see there has been support for overriding the content of a Regex
pattern prior to regex compilation for a very long time. Using this
infrastructure one can define drop in modules that override \d and \s
and whatever it is that people want to override, with *anything* they
want. So IMO this is a non-problem. People that really want \d to mean
\p{IsDigit} can just define a regex pattern filter to munge \d into
\p{IsDigit} and get the *sane* and predictable results they wanted.

Now if it turns out that what I describe above is impossible then i
will reconsider the subject of including legacy support for this at
the regex engine. However it has to be really really amazingly
impossible for me to go there.

Id just like to repeat a point here. WE CANT FIX THIS STUFF WITHOUT
BREAKING SOMETHING.

So we can either leave it broken for ever, or we can take the hit
sometime to deal with the underlying conceptual breakage, and that is
what i believe the plan is/was for 5.12.

demerphq

unread,
Oct 4, 2009, 4:00:32 AM10/4/09
to Tatsuhiko Miyagawa, karl williamson, jesse, Perl5 Porters, yves....@booking.com, Tom Christiansen
2009/10/4 Tatsuhiko Miyagawa <miya...@gmail.com>:

I have to admit I worried about this a bit, but came to the conclusion
that likely

a) more people get bitten by \d including things it shouldnt than
b) people like you who really want \d to mean \p{IsDigit}.

I do regret that it might impact you personally, and hope that we can
get some drop in regex compilation filters in place so that you can do
something like:

use re::UnicodeShortForms;

and have \d mean \p{IsDigit}

However I feel very strongly that somebody has to take it in the neck
to get this fixed, and so while I sympathise with anyone negatively
impacted, I cant really do more. If its not you it will be someone
else, regarding something else.

The status quo cant be fixed with something giving, and on the balance
of things the area that impacts you the most seems like the are likely
to impact the least number of people. Maybe we will hear more noise to
the contrary, in which case my view might change, but right now I dont
see a feasable path forward without changing this area of things.

My humble apologies for potentially ruining your day. Id be happy to
work with you to assist in coming up with a reasonable workaround.
Please DO keep the feedback, even negative coming, we may have missed
more serious breakage that we CAN resolve in a backwards compatible
way.

cheers,

Joshua ben Jore

unread,
Oct 4, 2009, 11:32:30 AM10/4/09
to demerphq, Tatsuhiko Miyagawa, karl williamson, jesse, Perl5 Porters, yves....@booking.com, Tom Christiansen

I've never seriously used this feature before but I did once pen a
lexical \w replacement to mean [A-Za-z'.-] because I was matching lots
of names. Below is Yves' suggestion for a community-solved problem.
Ought to work back to 5.6 too. Of course, this could work the *other*
way too. This turns \d => \p{IsDigit} but could also it into [0-9].

package re::UnicodeShortForms;
use 5.006;
use overload;
our %REPLACEMENTS = (
d => '\p{IsDigit}',
# w => ...
);
sub import { overload::constant( qr => \ &rewrite_regexp ) }
sub rewrite_regexp {
my ( undef, $text ) = @_;
$text =~ s{
\\
( d | . )
}{
$REPLACEMENTS{$1} || "\\$1"
}xge;
return $text;
}

'Josh'

Abigail

unread,
Oct 4, 2009, 5:20:32 PM10/4/09
to Tatsuhiko Miyagawa, karl williamson, jesse, demerphq, Perl5 Porters, yves....@booking.com, Tom Christiansen
On Sat, Oct 03, 2009 at 09:45:04PM -0700, Tatsuhiko Miyagawa wrote:
> On Sat, Oct 3, 2009 at 9:34 PM, karl williamson <pub...@khwilliamson.com> wrote:
>
> > There has been plenty of discussion about this over the years. 锟絋he simple
> > explanation is that the current scheme is broken in various ways. 锟絀n the

> > things I'm working on fixing, the breakage is essentially that the internal
> > storage detail (utf8 or not) of strings changes their external semantics.
>
> Yes, that has been really annoying and I appreciate your efforts to fix that.
>
> > For the \d, etc things, there are a number of arguments for changing their
> > behavior.
>
> Yes :)
>
> > The one I can think of right now, is that it currently can be a
> > security threat, in that most perl programs out there are not expecting
> > Unicode at all, and so having, eg., \d 锟絤atch not just 10 things but 411, or

> > \w match not just 63 things but 101,685 can lead to lots of unintended
> > consequences. 锟絀t seems better that the program has to explicitly indicate

> > that it is prepared to handle these expanded cases.
>
> I understand what your policy is about this, but I see expressions
> like "most perl programs" and "(The ASCII versions are) very commonly
> required" (earlier in this thread) that kind of upsets me, because
> that's not what I expect in most modern perl programs I write both
> personally and at work.


For \d to mean "any digit" or for \d to mean [0-9] is both reasonable.
As long as \d remains on its own. The problem starts as soon as people
write something like /1\d/.

Surely they aren't expecting to match a 1 followed by an Thai 7.

Even a /\d+/ will often match more than people want, as it can happily
match a string of digits of different scripts. The fact that

if (/(\d+)/) {
$num += $1;
}

can lead to warnings and unexpected behaviour makes, IMO, \d rather useless.

Personally, I haven't used \d in many years. Not only matches it too much,
it will also match different characters depending on the Perl version. And
it matches "digits" that Perl doesn't know how to use in an arithmetic
expression.

I do use \w instead of [a-zA-Z_0-9] which is normally what I want to match -
but that's just me taking shortcuts; [a-zA-Z_0-9] is a bit long to type.


I would personally favour if \d becomes just a shortcut for [0-9], and \w a
shortcut for [a-zA-Z_0-9]. All the time. Regardless of internal encoding
of the subject string, the locale, or how the regexp itself is encoded.

Not until then I will stop recommending people to not use \d or \w.

Abigail

John

unread,
Oct 4, 2009, 6:18:17 PM10/4/09
to Perl5 Porters
John wrote:
>
>> I would personally favour if \d becomes just a shortcut for [0-9],
>> and \w a
>> shortcut for [a-zA-Z_0-9]. All the time. Regardless of internal encoding
>> of the subject string, the locale, or how the regexp itself is encoded.
>>
>> Not until then I will stop recommending people to not use \d or \w.
>>
>>
>>
>> Abigail
>>
>
> Hi all,
> I would like to throw my two bits into the mix as a user, rather than
> a developer of Perl.
>
> I think if you want the unicode semantics and realy want to match all
> unicode didgits you should be forced to write \p{Digit}
>
> \d should, as Abigail staits, match only [0-9]
>
> This should also hold for other shortcuts with pre-unicode definitions.
>
> This does not help out with Locals, for which I propose a new set of
> regex properties \C{Property Name}.
>
> I picked 'C' as C tends to be the default Local
>
> So /\Cw/ or /\C{Word}/ would match a worh character in the current
> Local.
>
> John
>
> ______________________________________________ This email has
> been scanned by Netintelligence

Adendum.

C is allready used. Argh!

So how about Z :-)

______________________________________________
This email has been scanned by Netintelligence
http://www.netintelligence.com/email

John

unread,
Oct 4, 2009, 6:15:06 PM10/4/09
to Perl5 Porters

> I would personally favour if \d becomes just a shortcut for [0-9], and \w a
> shortcut for [a-zA-Z_0-9]. All the time. Regardless of internal encoding
> of the subject string, the locale, or how the regexp itself is encoded.
>
> Not until then I will stop recommending people to not use \d or \w.
>
>
>
> Abigail
>

Hi all,


I would like to throw my two bits into the mix as a user, rather than a
developer of Perl.

I think if you want the unicode semantics and realy want to match all
unicode didgits you should be forced to write \p{Digit}

\d should, as Abigail staits, match only [0-9]

This should also hold for other shortcuts with pre-unicode definitions.

This does not help out with Locals, for which I propose a new set of
regex properties \C{Property Name}.

I picked 'C' as C tends to be the default Local

So /\Cw/ or /\C{Word}/ would match a worh character in the current Local.

John

______________________________________________
This email has been scanned by Netintelligence

http://www.netintelligence.com/email

Eric Brine

unread,
Oct 4, 2009, 7:28:54 PM10/4/09
to Abigail, Tatsuhiko Miyagawa, karl williamson, jesse, demerphq, Perl5 Porters, yves....@booking.com, Tom Christiansen
On Sun, Oct 4, 2009 at 5:20 PM, Abigail <abi...@booking.com> wrote:

> I would personally favour if \d becomes just a shortcut for [0-9], and \w a
> shortcut for [a-zA-Z_0-9]. All the time. Regardless of internal encoding
> of the subject string, the locale, or how the regexp itself is encoded.
>

I noticed you didn't touch \s, which is the one that troubles me (too?). I
often use \d and \w in patterns that are captured. It's good to match
tightly, so I agree with you. \s, on the other hand, matches parts of the
input I usually wish to discard. Having it behave laxly (i.e. match
characters such as NBSP) would benefit me.

- ELB

Aristotle Pagaltzis

unread,
Oct 4, 2009, 11:11:16 PM10/4/09
to perl5-...@perl.org
* Eric Brine <ike...@adaelis.com> [2009-10-05 01:30]:

> I noticed you didn't touch \s, which is the one that troubles
> me (too?). I often use \d and \w in patterns that are captured.
> It's good to match tightly, so I agree with you. \s, on the
> other hand, matches parts of the input I usually wish to
> discard. Having it behave laxly (i.e. match characters such as
> NBSP) would benefit me.

That’s a sore point for me. Even the fact that \s matches newline
often annoys me. I wish there was a shorthand for [ \t] which is
what I usually want when I use \s – though I often use \s anyway
for the brevity when it’s not a huge issue.

Regards,
--
Aristotle Pagaltzis // <http://plasmasturm.org/>

Mark Mielke

unread,
Oct 5, 2009, 12:13:06 AM10/5/09
to perl5-...@perl.org

+1

\s only does what I mean some of the time today. Adding extra meaning to
it just means more places that it doesn't quite work. A lot of code does
line-based parsing first (even if just using while (<>)), and then uses
\s liberally as a "any white space except for newline." If \s starts
matching other kinds of newlines - this code will be more broken than it
is today.

I've mostly given up on \s. I frequently use [ \t] instead as well.

For \d, I would expect every situation where /\A(\d+)\z/ for (0+$1 ==
0+$_). I have not followed this thread very well - what does Perl do
(0+$_) if it encounters a string with unicode numbers?

For \w, this is used very strictly in lexers/parsers. I frequently use
it to match exactly 'A' - 'Z', 'a' - 'z', '0' - '9', and '_' (ASCII)
before passing the argument to an external program. If it starts passing
additional characters through, I can think of several external
applications that *will* break, because they don't understand unicode.

For \s, it's a big unusable at present, and changing the definition will
create more confusion and breakage than gain.

Changing \s, \w, and \d from their traditional meanings sounds dangerous.

My opinion.

What is the real gain here? That some applications will magically start
supporting additional unicode sequences and "just work"? That people can
type fewer regexp operands to get "new"-style behaviour? How many people
want this?

I suppose if it is "all non-English writers" my opinion might be
out-numbered. :-)

Cheers,
mark

--
Mark Mielke<ma...@mielke.cc>

Tom Christiansen

unread,
Oct 5, 2009, 12:32:23 AM10/5/09
to Eric Brine, Abigail, Tatsuhiko Miyagawa, karl williamson, jesse, demerphq, Perl5 Porters, yves....@booking.com

As \s is currently [\h\v], perhaps you'd like "horizontal space" via \h:

U+0009 CHARACTER TABULATION
U+0020 SPACE
U+00A0 NO-BREAK SPACE
U+1680 OGHAM SPACE MARK
U+180E MONGOLIAN VOWEL SEPARATOR
U+2000 EN QUAD
U+2001 EM QUAD
U+2002 EN SPACE
U+2003 EM SPACE
U+2004 THREE-PER-EM SPACE
U+2005 FOUR-PER-EM SPACE
U+2006 SIX-PER-EM SPACE
U+2007 FIGURE SPACE
U+2008 PUNCTUATION SPACE
U+2009 THIN SPACE
U+200A HAIR SPACE
U+202F NARROW NO-BREAK SPACE
U+205F MEDIUM MATHEMATICAL SPACE
U+3000 IDEOGRAPHIC SPACE

(Ogham? Mongolian? Hmmm.)

Which doesn't include "vertical space" via \v:

U+000A LINE FEED (LF)
U+000B LINE TABULATION
U+000C FORM FEED (FF)
U+000D CARRIAGE RETURN (CR)
U+0085 NEXT LINE (NEL)
U+2028 LINE SEPARATOR
U+2029 PARAGRAPH SEPARATOR

--tom

Mark Mielke

unread,
Oct 5, 2009, 12:41:09 AM10/5/09
to Tom Christiansen, Eric Brine, Abigail, Tatsuhiko Miyagawa, karl williamson, jesse, demerphq, Perl5 Porters, yves....@booking.com
On 10/05/2009 12:32 AM, Tom Christiansen wrote:
>
>> I noticed you didn't touch \s, which is the one that troubles me (too?). I
>> often use \d and \w in patterns that are captured. It's good to match
>> tightly, so I agree with you. \s, on the other hand, matches parts of the
>> input I usually wish to discard. Having it behave laxly (i.e. match
>> characters such as NBSP) would benefit me.
>>
> As \s is currently [\h\v], perhaps you'd like "horizontal space" via \h:
>
> U+0009 CHARACTER TABULATION
> ...

Yes, that works - except it's not in 5.6.0 which we still use. :-(
Gah... why are large companies always stuck in the past?

Thanks for the suggestion.

Demerphq

unread,
Oct 5, 2009, 3:29:29 AM10/5/09
to John, Perl5 Porters
2009/10/5 John <john....@vodafoneemail.co.uk>:

>
>> I would personally favour if \d becomes just a shortcut for [0-9], and \w
>> a
>> shortcut for [a-zA-Z_0-9]. All the time. Regardless of internal encoding
>> of the subject string, the locale, or how the regexp itself is encoded.
>>
>> Not until then I will stop recommending people to not use \d or \w.
>>
>>
>>
>> Abigail
>>
>
> Hi all,
> I would like to throw my two bits into the mix as a user, rather than a
> developer of Perl.
>
> I think if you want the unicode semantics and realy want to match all
> unicode didgits you should be forced to write \p{Digit}
>
> \d should, as Abigail staits, match only [0-9]
>
> This should also hold for other shortcuts with pre-unicode definitions.
>
> This does not help out with Locals, for which I propose a new set of regex
> properties \C{Property Name}.
>
> I picked 'C' as C tends to be the default Local
>
> So /\Cw/ or /\C{Word}/  would match a worh character in the current Local.

Can you explain why you want the locale to influence things?

You see, the locale handling logic is basically a template of what we
dont want, in the regex engine, or in general. If i could get away
with it I would deprecate use locale, and all of the locale based
regops as they are major maintenance nightmare for IMO little benefit.
Once your text is stored as unicode you can define any properties you
wish. The regex engine/unicode infrastructure already has ways of
dealing with them and includes a comprehensive framework for dealing
with most everything we could want. What does locale give you? (Honest
question, I never use it, and have never seen a need to.)

Abigail

unread,
Oct 5, 2009, 4:27:38 AM10/5/09
to Tom Christiansen, Eric Brine, Tatsuhiko Miyagawa, karl williamson, jesse, demerphq, Perl5 Porters, yves....@booking.com
On Sun, Oct 04, 2009 at 10:32:23PM -0600, Tom Christiansen wrote:
> >On Sun, Oct 4, 2009 at 5:20 PM, Abigail <abi...@booking.com> wrote:
>
> >> I would personally favour if \d becomes just a shortcut for [0-9], and \w a
> >> shortcut for [a-zA-Z_0-9]. All the time. Regardless of internal encoding
> >> of the subject string, the locale, or how the regexp itself is encoded.
>
> > I noticed you didn't touch \s, which is the one that troubles me (too?). I
> > often use \d and \w in patterns that are captured. It's good to match
> > tightly, so I agree with you. \s, on the other hand, matches parts of the
> > input I usually wish to discard. Having it behave laxly (i.e. match
> > characters such as NBSP) would benefit me.

\s troubles me less, because there isn't a equivalent issue to /1\d/ or
/\d+/. Furthermore, when one does want to seperate "types" of whitespace,
one usually wants horizontal and vertical whitespace. For which we have
/\v/ and /\h/. I'd like to see /\s/ fixed in the sense that it matches
a fixed set of characters, regardless of encoding. I don't really care
whether it restricts itself to ASCII only, or whether "\x0b" is included
or not - after all, \h, \v and [\h\v] are already very useful.

>
> As \s is currently [\h\v]

It's not.

$ perl -E 'say "\x0b" =~ /\s/ ? "Yes" : "No"'
No
$ perl -E 'say "\x0b" =~ /\v/ ? "Yes" : "No"'
Yes

And while "\x85" always matches /\v/, it only matches /\s/ under
UTF-8 matching. Similar for "\xA0", which always matches /\h/, but
has a UTF-8 matching dependency on whether it matches /\s/.


See also "man perlrecharclass".


Abigail

Abigail

unread,
Oct 5, 2009, 6:12:17 AM10/5/09
to perl5-...@perl.org


It seems you want to match horizontal whitespace. I use \h when I need
that, that will match space, tab, the no-break space, and a handful of
Unicode spaces.


Abigail

Karl Williamson

unread,
Oct 5, 2009, 1:59:39 PM10/5/09
to perl5-...@perl.org

While not short names,
in 5.11 [[:Blank:]] and \p{PosixBlank} should match just TAB and SPACE.

John

unread,
Oct 5, 2009, 2:31:41 PM10/5/09
to perl5-...@perl.org

>
> For \s, it's a big unusable at present, and changing the definition
> will create more confusion and breakage than gain.
>
> Changing \s, \w, and \d from their traditional meanings sounds dangerous.
>
> My opinion.
>
> What is the real gain here? That some applications will magically
> start supporting additional unicode sequences and "just work"? That
> people can type fewer regexp operands to get "new"-style behaviour?
> How many people want this?
>
> I suppose if it is "all non-English writers" my opinion might be
> out-numbered. :-)
>
> Cheers,
> mark
>

I'm in agreement with mark. If you want unicode semantics for characters
that can be used in identifiers then you have \p{ID_Start} and
\p{ID_Continue}

\w shoud be its historical meaning

David Nicol

unread,
Oct 5, 2009, 2:29:13 PM10/5/09
to Perl5 Porters
On Sun, Oct 4, 2009 at 11:32 PM, Tom Christiansen <tch...@perl.com> wrote:

>    U+180E MONGOLIAN VOWEL SEPARATOR


方法については、コンパイルのデフォルトを設定する時のオプションとの包括的なpragamataごとに、これらのすべての組み合わせを?宮川達彦と彼は書いすべてのPerlプログラムの上部に行を配置する必要はありませんその方法です。

Let's map each pragma to a compile-time option so Tatsuhiko Miyagawa
can declare his preferences at install time.

のインストール時に自分の嗜好を宣言できるようコンパイルするために、各プラグマ- timeオプション宮川達彦ので、地図をしましょう。

John

unread,
Oct 5, 2009, 3:04:37 PM10/5/09
to demerphq, Perl5 Porters
I would realy like to be able to do is somthing like this.

my $str = 'garçon';
use local 'fr';
print "Contains french exemplar characters" if $string=~/^\w+$/;
use local 'en';
print "Contains non english exemplar characters" unless $string=~/^\w+$/;

And then make it an IO filter as well. Then I could have 1000.00 be
rendered as 1,000.00 or 1.000,00 depending on the local.

Tom Christiansen

unread,
Oct 5, 2009, 3:03:42 PM10/5/09
to John, perl5-...@perl.org
>\w should be its historical meaning

Careful: wouldn't historical meaning include
locales, wherein \w would also include (for example)
� and � in French, � in Spanish, � in German,
and � and � in Icelandic? And didn't we already
find that locale-shifting char classes made
life really hard on the regex engine (at least)?

I don't know whether this is harder on it than
it already suffers under the Unicode vs bytes
shifts in behavior, but both seem problematic
to an annoying degree.

This is why my test program was tricked into
thinking \s suddenly started matching VT like
\v does, despite decades of historical precedent.
I'd forced it into Unicode mode. :(

--tom
--
"Toss no fish to hysterical porpoises."

Demerphq

unread,
Oct 5, 2009, 3:21:51 PM10/5/09
to Tom Christiansen, John, perl5-...@perl.org
2009/10/5 Tom Christiansen <tch...@perl.com>:

>>\w should be its historical meaning
>
> Careful: wouldn't historical meaning include
> locales, wherein \w would also include (for example)
> é and ç in French, ñ in Spanish, ß in German,
> and ð and þ in Icelandic?  And didn't we already

> find that locale-shifting char classes made
> life really hard on the regex engine (at least)?

use locale is in some respects broken by qr//, as it doesnt use regex
flags and depends on the context it is compiled within.

So for instance, if you use local and the have a sub return a qr//
compiled regex and then use that object alone in a match anywhere that
you pass it it will match using the semantics of the locale in effect
when it is matched. If the qr// is inserted in another pattern the
localeness of the pattern is destroyed.

In short qr// results compiled under use locale have different results
depending on how they are used. These regexes are also much slower
than ones not compiled under locale as they have to do a lot more run
time comparisons to check if they match.

> I don't know whether this is harder on it than
> it already suffers under the Unicode vs bytes
> shifts in behavior, but both seem problematic
> to an annoying degree.

Locale regexes are irritating because you cant precompute them. They
are defined to change based on your environment which can change in
between compilation and execution of the regex. So you delay a lot of
stuff that could be precomputed to inside of the regex matching loop.

> This is why my test program was tricked into
> thinking \s suddenly started matching VT like
> \v does, despite decades of historical precedent.
> I'd forced it into Unicode mode.  :(

And this is why we really really want \w and \s and \d to match the
traditional thing, even if this means requiring people add something
to older scripts to support the legacy behaviour. You cant tell what a
pattern does by looking at it, you have to know the internal bit flags
of the string involved.

Abigail

unread,
Oct 5, 2009, 3:37:22 PM10/5/09
to John, demerphq, Perl5 Porters
On Mon, Oct 05, 2009 at 08:04:37PM +0100, John wrote:
>>
> I would realy like to be able to do is somthing like this.
>
> my $str = 'gar�on';

> use local 'fr';
> print "Contains french exemplar characters" if $string=~/^\w+$/;
> use local 'en';
> print "Contains non english exemplar characters" unless $string=~/^\w+$/;

You can do it this way, not having to depend on whatever may be installed
on the system:

sub French {return <<"--"} # Might not be the correct set of French chars.
41 5A
61 7A
C0 C2
C6 CA
CC CE
D2 D4
E0 E2
E6 EA
EC EE
F2 F4
--

say "Contains French exemplar characters" if $string =~ /^\p{French}+$/;


Abigail

Jan Dubois

unread,
Oct 5, 2009, 3:37:11 PM10/5/09
to demerphq, Tom Christiansen, John, perl5-...@perl.org
On Mon, 05 Oct 2009, demerphq wrote:
> And this is why we really really want \w and \s and \d to match the
> traditional thing, even if this means requiring people add something
> to older scripts to support the legacy behaviour. You cant tell what a
> pattern does by looking at it, you have to know the internal bit flags
> of the string involved.

Just to be sure: \b will continue to be defined based on \w and \W
and change its behavior as well, right? I'm only asking because \b is
not explicitly listed in this discussion.

Cheers,
-Jan


John

unread,
Oct 5, 2009, 3:48:13 PM10/5/09
to Tom Christiansen, perl5-...@perl.org
Tom Christiansen wrote:
>> \w should be its historical meaning
>>
>
> Careful: wouldn't historical meaning include
> locales, wherein \w would also include (for example)
> é and ç in French, ñ in Spanish, ß in German,
> and ð and þ in Icelandic? And didn't we already
> find that locale-shifting char classes made
> life really hard on the regex engine (at least)?
>
>
Thats why in an earlier email I proposed \Z{w} to handle locals so \w
stays fixed.

However if we could tweek 'locale' I'd like to be able to do

use local 'fr';

and have \w match french accented characters as well.

Rafael Garcia-Suarez

unread,
Oct 5, 2009, 5:19:56 PM10/5/09
to Abigail, Perl5 Porters
2009/10/5 Abigail <abi...@abigail.be>:

> On Mon, Oct 05, 2009 at 08:04:37PM +0100, John wrote:
>>>
>> I would realy like to be able to do is somthing like this.
>>
>> my $str = 'garçon';

>> use local 'fr';
>> print "Contains french exemplar characters" if $string=~/^\w+$/;
>> use local 'en';
>> print "Contains non english exemplar characters" unless $string=~/^\w+$/;
>
> You can do it this way, not having to depend on whatever may be installed
> on the system:
>
> sub French {return <<"--"}  # Might not be the correct set of French chars.

It's not; actually you even need codepoints > FF.
FWIW I agree with simplifying \d and \w everywhere. Using more complex
forms to match more complex sets is good huffman-coding, and is good
code documentation too.

Tom Christiansen

unread,
Oct 5, 2009, 5:52:25 PM10/5/09
to Abigail, John, demerphq, Perl5 Porters
Abigail wrote:

> You can do it this way, not having to depend on whatever may be
> installed on the system:

[...]

Yours seems a much better approach. Being bound to the whims of one's
current system's idea of correct locales, let alone the current user's
setting of the same, is too unreliable. I've seen many errors in system
locales. For example, some of these symlinks senselessly point to ASCII:

darwin% ls - /usr/share/locale/nl*/LC_C*
lrwxr-xr-x [...] 29 Nov 7 2008 /usr/share/locale/nl_BE.ISO8859-1/LC_COLLATE@ -> ../la_LN.ISO8859-1/LC_COLLATE
lrwxr-xr-x [...] 27 Nov 7 2008 /usr/share/locale/nl_BE.ISO8859-1/LC_CTYPE@ -> ../la_LN.ISO8859-1/LC_CTYPE
lrwxr-xr-x [...] 30 Nov 7 2008 /usr/share/locale/nl_BE.ISO8859-15/LC_COLLATE@ -> ../la_LN.ISO8859-15/LC_COLLATE
lrwxr-xr-x [...] 28 Nov 7 2008 /usr/share/locale/nl_BE.ISO8859-15/LC_CTYPE@ -> ../la_LN.ISO8859-15/LC_CTYPE
lrwxr-xr-x [...] 28 Nov 7 2008 /usr/share/locale/nl_BE.UTF-8/LC_COLLATE@ -> ../la_LN.US-ASCII/LC_COLLATE
lrwxr-xr-x [...] 17 Nov 7 2008 /usr/share/locale/nl_BE.UTF-8/LC_CTYPE@ -> ../UTF-8/LC_CTYPE
lrwxr-xr-x [...] 28 Nov 7 2008 /usr/share/locale/nl_BE/LC_COLLATE@ -> ../la_LN.US-ASCII/LC_COLLATE
lrwxr-xr-x [...] 17 Nov 7 2008 /usr/share/locale/nl_BE/LC_CTYPE@ -> ../UTF-8/LC_CTYPE
lrwxr-xr-x [...] 29 Nov 7 2008 /usr/share/locale/nl_NL.ISO8859-1/LC_COLLATE@ -> ../la_LN.ISO8859-1/LC_COLLATE
lrwxr-xr-x [...] 27 Nov 7 2008 /usr/share/locale/nl_NL.ISO8859-1/LC_CTYPE@ -> ../la_LN.ISO8859-1/LC_CTYPE
lrwxr-xr-x [...] 30 Nov 7 2008 /usr/share/locale/nl_NL.ISO8859-15/LC_COLLATE@ -> ../la_LN.ISO8859-15/LC_COLLATE
lrwxr-xr-x [...] 28 Nov 7 2008 /usr/share/locale/nl_NL.ISO8859-15/LC_CTYPE@ -> ../la_LN.ISO8859-15/LC_CTYPE
lrwxr-xr-x [...] 28 Nov 7 2008 /usr/share/locale/nl_NL.UTF-8/LC_COLLATE@ -> ../la_LN.US-ASCII/LC_COLLATE
lrwxr-xr-x [...] 17 Nov 7 2008 /usr/share/locale/nl_NL.UTF-8/LC_CTYPE@ -> ../UTF-8/LC_CTYPE
lrwxr-xr-x [...] 28 Nov 7 2008 /usr/share/locale/nl_NL/LC_COLLATE@ -> ../la_LN.US-ASCII/LC_COLLATE
lrwxr-xr-x [...] 17 Nov 7 2008 /usr/share/locale/nl_NL/LC_CTYPE@ -> ../UTF-8/LC_CTYPE

darwin% ls -l /usr/share/locale/es*/LC_C*
-r--r--r-- [...] 2518 May 31 2008 /usr/share/locale/es_ES.ISO8859-1/LC_COLLATE
lrwxr-xr-x [...] 27 Nov 7 2008 /usr/share/locale/es_ES.ISO8859-1/LC_CTYPE@ -> ../la_LN.ISO8859-1/LC_CTYPE
-r--r--r-- [...] 2518 May 31 2008 /usr/share/locale/es_ES.ISO8859-15/LC_COLLATE
lrwxr-xr-x [...] 28 Nov 7 2008 /usr/share/locale/es_ES.ISO8859-15/LC_CTYPE@ -> ../la_LN.ISO8859-15/LC_CTYPE
lrwxr-xr-x [...] 28 Nov 7 2008 /usr/share/locale/es_ES.UTF-8/LC_COLLATE@ -> ../la_LN.US-ASCII/LC_COLLATE
lrwxr-xr-x [...] 17 Nov 7 2008 /usr/share/locale/es_ES.UTF-8/LC_CTYPE@ -> ../UTF-8/LC_CTYPE
lrwxr-xr-x [...] 28 Nov 7 2008 /usr/share/locale/es_ES/LC_COLLATE@ -> ../la_LN.US-ASCII/LC_COLLATE
lrwxr-xr-x [...] 17 Nov 7 2008 /usr/share/locale/es_ES/LC_CTYPE@ -> ../UTF-8/LC_CTYPE

I believe that shows why you must not depend on system locales.
Ctype definitions apply to character classes, but these may not be
correct, as shown above. Unicode properties are more predictable,
and although they might not be what you want (eg, \p{IsDigit} vs
[0-9]), you know what they are. Right?

For collation, I advise ignoring locales altogether and using only
Unicode::Collate objects' sort method. You might wish to add arguments
for special treatment of say, ij in Dutch or ch in Spanish, but then
you know what you're getting, having spelt out your rules yourself.

That's why I see explicitly listing characters you want to count as
being "in a language" the way you've done, Abigail, as more reliable.
Language rules lie outside of what Unicode addresses, and POSIX locales
cannot be counted on. Neither can user settings of the same.

--tom

PS: Contrary to widespread misunderstanding ASCII is insufficient even
for English, as it cannot be correctly expressed in ASCII alone.
Otherwise you're doomed to an absurd, telegraph-like reduction.

Tom Christiansen

unread,
Oct 5, 2009, 7:46:50 PM10/5/09
to John, perl5-...@perl.org
John <john....@vodafoneemail.co.uk> wrote:

>>> \w should be its historical meaning

>> Careful: wouldn't historical meaning include
>> locales, wherein \w would also include (for example)
>> é and ç in French, ñ in Spanish, ß in German,
>> and ð and þ in Icelandic? And didn't we already
>> find that locale-shifting char classes made
>> life really hard on the regex engine (at least)?

> That[']s why in an earlier email I proposed \Z{w} to
> handle local[e]s so \w stays fixed.

Or broken? :)

> However if we could twe[a]k 'locale' I'd like to be able to do

> use local[e] 'fr';

> and have \w match [F]rench accented characters as well.

But this is not at all so easy as you may think!

It's asking a lot for the Perl core, so much that it seems
better served by something in some Lingua::FR:: namesapce.

[ELABORATION FOLLOWS IN SOME DETAIL]

For example, French orthography requires not only
the three accent marks:

* the acute accent, for é
* the grave accent, for è, à, ù
* the circumflex accent, for any of the five vowels,
as in fenêtre, forêt, côte, sûr

But accent marks aren't enough, nor may those be arbitarily
applied. Beyond accent marks, proper French orthography also
demands several other mandatory diacritics and digraphs, such as:

* the cedille to mark a soft c before [aou], as in
ça and français itself

* the di[a]eresis to mark a vowel in hiatus rather than in diphthong,
as in maïs, Haïti, Noël, capharnaüm; also occasionally after
g as in ambiguë, argüer, mangeüre; plus for a few imported words

* the digraph œ, mandatory in words like œuf, bœuf, chœur, œnologie

* the digraph æ, now needed (I believe) only for l'æschne in
current French, but perhaps also for Latin imports like
curriculum vitæ and et cætera

You'll also have to consider combining characters besides precomposed
ones. And you must be picky: just because you accept, for example,
the acute accent for é doesn't mean you accept á or ć; those are
illegal in native French words. They might each respectively occur in
Spanish or Polish, but so might many other things. Still, those might
well appear in an otherwise French text, so what do you do then?

Note this is current French only; historically other possibilities may
have occurred which should probably be considered.

And I've not even thought about what might be done with hyphenated words
or those with apostrophes. You can't leave them out; those are no more
multiple words than they'd be in English. Their hyphens and apostrophes
are required elements without being per se "letters". If you're going
to do \w "right", you really ought to think about such components, too.

I disavow any expertise in French, so matters may be otherwise
than as presented here. I doubt they're substantially simpler.

So you see, it's well beyond any mere matter of "accent marks".

While in actuality (French sense :) we might indeed have here
the knowledge needed for French (naming no names :), we surely
lack it for many other languages this would open the door for.

I don't mean there's nothing to what you said, John.

But I dread seeing such language-related rulesets in a core pragma within
the standard Perl distribution instead of off in some Lingua::*:: module on
CPAN, created and maintained by people with true expertise in each language.

--tom

Karl Williamson

unread,
Oct 6, 2009, 12:30:05 AM10/6/09
to Jan Dubois, demerphq, Tom Christiansen, John, perl5-...@perl.org
I had considered for a little while of upgrading \b to the newer Unicode
Word_Break property, but decided against it. Thus unless someone else
were gung-ho to do that, \b would continue to be defined in terms of \w.
But your question prompted me to look at the code, and it appears to
me, Yves, that something would have to be done to address this. Thanks
for pointing it out.

Karl Williamson

unread,
Oct 6, 2009, 1:16:00 AM10/6/09
to demerphq, Tom Christiansen, John, perl5-...@perl.org
demerphq wrote:
> 2009/10/5 Tom Christiansen <tch...@perl.com>:
>>> \w should be its historical meaning
>> Careful: wouldn't historical meaning include
>> locales, wherein \w would also include (for example)
>> � and � in French, � in Spanish, � in German,
>> and � and � in Icelandic? And didn't we already

In reading these comments all at once, I'm not sure we are all on the
same page as to the proposal, and what happens now. So, let me state
what I think both are; correct me if I'm wrong:

The way it works now:

With a 'use locale' or on an EBCDIC platform:
they match whatever the C language ctype routines say they match:
isdigit() for \d, isspace() for \s, and isalnum() for \w (but I know \w
adds underscore but I didn't see where it was doing that in a quick scan
of the code).

Absent a 'use locale' and not on an EBCDIC platform:

If (the string being matched against doesn't have the utf8 flag on.
&& the regular expression doesn't contain something that would
make it look like it should behave in utf8 semantics. Any \p{}
in it, for example, will force it into utf8)
{
\d = [0-9]; \w = [_a-zA-Z\d]; \s = [ \t\f\r\n]

} else {

they match what Unicode says, except that there are some bugs so that
\w matches too much, like fractions.

}

What I meant to say was the proposal:
No change to 'use locale' or EBCDIC. Even if we could deprecate 'use
locale', we would be stuck with supporting it in 5.12, I think.

Otherwise, \d = [0-9]; \w = [_a-zA-Z\d]; \s = [ \t\f\r\n]
regardless.

Demerphq

unread,
Oct 6, 2009, 3:29:28 AM10/6/09
to karl williamson, Jan Dubois, Tom Christiansen, John, perl5-...@perl.org
2009/10/6 karl williamson <pub...@khwilliamson.com>:

Yes I am aware of this. The BOUND regops handler (and relatives) need
to be fixed.

It will probably addressed along with /\s/ and stuff.

cheers

Demerphq

unread,
Oct 6, 2009, 3:31:22 AM10/6/09
to karl williamson, Tom Christiansen, John, perl5-...@perl.org
2009/10/6 karl williamson <pub...@khwilliamson.com>:

> In reading these comments all at once, I'm not sure we are all on the same
> page as to the proposal, and what happens now. So, let me state what I
> think both are; correct me if I'm wrong:
>
> The way it works now:
>
> With a 'use locale' or on an EBCDIC platform:
> they match whatever the C language ctype routines say they match: isdigit()
> for \d, isspace() for \s, and isalnum() for \w (but I know \w adds
> underscore but I didn't see where it was doing that in a quick scan of the
> code).
>
> Absent a 'use locale' and not on an EBCDIC platform:
>
> If (the string being matched against doesn't have the utf8 flag on.
> && the regular expression doesn't contain something that would make it
> look like it should behave in utf8 semantics. Any \p{} in
> it, for example, will force it into utf8)
> {
> \d = [0-9]; \w = [_a-zA-Z\d]; \s = [ \t\f\r\n]
>
> } else {
>
> they match what Unicode says, except that there are some bugs so
> that \w matches too much, like fractions.
>
> }
>
>
>
>
>
> What I meant to say was the proposal:
> No change to 'use locale' or EBCDIC. Even if we could deprecate 'use
> locale', we would be stuck with supporting it in 5.12, I think.
>
> Otherwise, \d = [0-9]; \w = [_a-zA-Z\d]; \s = [ \t\f\r\n]
> regardless.

With the caveat that I am replying prior to finishing my first cup of
coffee of the day I think this looks right.

John

unread,
Oct 6, 2009, 1:35:39 PM10/6/09
to demerphq, karl williamson, Tom Christiansen, perl5-...@perl.org

>>
>>
>> What I meant to say was the proposal:
>> No change to 'use locale' or EBCDIC. Even if we could deprecate 'use
>> locale', we would be stuck with supporting it in 5.12, I think.
>>
>> Otherwise, \d = [0-9]; \w = [_a-zA-Z\d]; \s = [ \t\f\r\n]
>> regardless.
>>
>
> With the caveat that I am replying prior to finishing my first cup of
> coffee of the day I think this looks right.
>
> Yves
>
>
>
>
>
I'm in agrrement as well :-)

Joshua ben Jore

unread,
Oct 6, 2009, 2:20:14 PM10/6/09
to demerphq, Tatsuhiko Miyagawa, karl williamson, jesse, Perl5 Porters, yves....@booking.com, Tom Christiansen
On Sun, Oct 4, 2009 at 8:32 AM, Joshua ben Jore <twi...@gmail.com> wrote:
> On Sun, Oct 4, 2009 at 1:00 AM, demerphq <deme...@gmail.com> wrote:
>> 2009/10/4 Tatsuhiko Miyagawa <miya...@gmail.com>:
>>> On Sat, Oct 3, 2009 at 9:34 PM, karl williamson <pub...@khwilliamson.com> wrote:
>>>
>>>> There has been plenty of discussion about this over the years.  The simple
>>>> explanation is that the current scheme is broken in various ways.  In the
>>>> things I'm working on fixing, the breakage is essentially that the internal
>>>> storage detail (utf8 or not) of strings changes their external semantics.
>>>
>>> Yes, that has been really annoying and I appreciate your efforts to fix that.
>>>
>>>> For the \d, etc things, there are a number of arguments for changing their
>>>> behavior.
>>>
>>> Yes :)
>>>
>>>> The one I can think of right now, is that it currently can be a
>>>> security threat, in that most perl programs out there are not expecting
>>>> Unicode at all, and so having, eg., \d  match not just 10 things but 411, or
>>>> \w match not just 63 things but 101,685 can lead to lots of unintended
>>>> consequences.  It seems better that the program has to explicitly indicate
>>>> that it is prepared to handle these expanded cases.
>>>
>>> I understand what your policy is about this, but I see expressions
>>> like "most perl programs" and "(The ASCII versions are) very commonly
>>> required" (earlier in this thread) that kind of upsets me, because
>>> that's not what I expect in most modern perl programs I write both
>>> personally and at work.
>>
>> I have to admit I worried about this a bit, but came to the conclusion
>> that likely
>>
>> a) more people get bitten by \d including things it shouldnt than
>> b) people like you who really want \d to mean \p{IsDigit}.
>>
>> I do regret that it might impact you personally, and hope that we can
>> get some drop in regex compilation filters in place so that you can do
>> something like:
>>
>> use re::UnicodeShortForms;
>>
>> and have \d mean \p{IsDigit}
>
> I've never seriously used this feature before but I did once pen a
> lexical \w replacement to mean [A-Za-z'.-] because I was matching lots
> of names. Below is Yves' suggestion for a community-solved problem.
> Ought to work back to 5.6 too. Of course, this could work the *other*
> way too. This turns \d => \p{IsDigit} but could also it into [0-9].
>
> package re::UnicodeShortForms;
> use 5.006;
> use overload;
> our %REPLACEMENTS = (
>    d => '\p{IsDigit}',
>    # w => ...
> );
> sub import { overload::constant( qr => \ &rewrite_regexp ) }
> sub rewrite_regexp {
>    my ( undef, $text ) = @_;
>    $text =~ s{
>        \\
>        ( d | . )
>    }{
>        $REPLACEMENTS{$1} || "\\$1"
>    }xge;
>    return $text;
> }
>
> 'Josh'

Just FYI, the above ought to work pretty darn well *except* that it
isn't skipping over the contents of (?#), (?{}), (??{}) blocks where
the characters \w might be otherwise meaningful. I'm sure something
like http://search.cpan.org/dist/Regexp-Parser do the right thing.

Josh

Jesse

unread,
Oct 27, 2009, 9:31:41 PM10/27/09
to demerphq, jesse, karl williamson, Tatsuhiko Miyagawa, Perl5 Porters, yves....@booking.com, Tom Christiansen
Tonight on #perl6, Larry made a fairly definitive statement about \w
matching and unicode:

20:50 <@TimToady> it's terribly bad Huffman coding to restrict \w to ascii
20:50 <@TimToady> obra__: see ^^
20:50 <@TimToady> please kill that idea
20:51 <@TimToady> Perl 5 must not revert to ASCII semantics where we've been
gaining ground on Unicode for many years.

So. Now we know where we're aiming. How good are our tests? ;)

Abigail

unread,
Oct 27, 2009, 9:48:19 PM10/27/09
to jesse, Perl5 Porters


That's a pity.

But more importantly, did Larry express any sentiment regarding \d?


Abigail

Jesse

unread,
Oct 27, 2009, 10:06:38 PM10/27/09
to Abigail, jesse, Perl5 Porters

Now he has:

22:05 <obra_> TimToady: ping? Abigail asks if the same ruling applies to \d (and presumably also \s)
22:06 <@TimToady> yes, it does

> Abigail
>

--

Mark Mielke

unread,
Oct 27, 2009, 11:17:17 PM10/27/09
to jesse, Abigail, Perl5 Porters
On 10/27/2009 10:06 PM, jesse wrote:
> On Wed, Oct 28, 2009 at 02:48:19AM +0100, Abigail wrote:
>
>> That's a pity.
>> But more importantly, did Larry express any sentiment regarding \d
> Now he has:
>
> 22:05<obra_> TimToady: ping? Abigail asks if the same ruling applies to \d (and presumably also \s)
> 22:06<@TimToady> yes, it does
>

Probably sacrilegious - but I think this is a poor decision, and don't
see how it is Rule 1 material. Do we see Larry around here any more? I
do not see posts, and Rule 1 via proxy with cut + paste from IRC is like
dumping a girl friend via a text message. If Larry really cared, he
should post something eloquent on this mailing list - not a one line
summary that says "we've come this far already - no turning back now."

Then again - my usage of Perl has been on a steady decline, and I've
seen the same trend elsewhere. By the time Perl perfectly supports
Unicode, after so many imperfection evolutions, I predict Perl 5 will
not be relevant nor a target programming language for applications that
require full Unicode support. There are so many alternatives now-a-days,
that do at least as good a job, that it is inevitable for this to occur.
Perl 5's main staying power in my opinion was it's portability - but
major changes in behaviour across releases seriously degrades
portability, and the widespread use of XS in CPAN modules also works
against portability. Choosing to use Perl as the best-in-choice language
for an application in 2009 is becoming a lot more difficult.

Feel free to begin throwing rotting fruit at me and/or banning me from
the list.

Yuval Kogman

unread,
Oct 28, 2009, 3:25:30 AM10/28/09
to Perl5 Porters
How difficult would it be to introduce special chars which aren't
charclasses, which are probably more suitable for what people want anyway
(things that agree with grok_number, with rules for natural numbers,
integers, decimal fractions, and floating point notation)?

Seems like the distinction between matching a character that is a digit vs.
matching ascii digits is mostly about what you do with the numbers
afterwords. Perhaps it's better to just remove the extra duplication?

Demerphq

unread,
Oct 28, 2009, 4:52:42 AM10/28/09
to jesse, karl williamson, Tatsuhiko Miyagawa, Perl5 Porters, yves....@booking.com, Tom Christiansen
2009/10/28 jesse <je...@fsck.com>:

I am *extremely* upset at this.

Not only was your question not what we propose to do, but the fact
that you didnt discuss it with me, or have this discussion with me
makes me extremely unhappy.

We have bugs. Bugs cant be resolved by appeal to Larry.

And since Larry isnt doing the work, and likely *I* will be then I
reckon i should have been included in the discussion.

Im really tempted to say find yourself another regex engine hacker.

cheers,

demerphq

unread,
Oct 28, 2009, 5:01:35 AM10/28/09
to jesse, karl williamson, Tatsuhiko Miyagawa, Perl5 Porters, yves....@booking.com, Tom Christiansen
2009/10/28 demerphq <deme...@gmail.com>:

And actually, I simply fundamentally reject this as a rule 1 ruling.

If you want to get a rule 1 decision on this that I will accept plan
to have me in the discussion.

Mark Mielke

unread,
Oct 28, 2009, 6:10:44 AM10/28/09
to demerphq, jesse, karl williamson, Tatsuhiko Miyagawa, Perl5 Porters, yves....@booking.com, Tom Christiansen
On 10/28/2009 05:01 AM, demerphq wrote:
> 2009/10/28 demerphq<deme...@gmail.com>:

>
>>
>> We have bugs. Bugs cant be resolved by appeal to Larry.
>>
>> And since Larry isnt doing the work, and likely *I* will be then I
>> reckon i should have been included in the discussion.
>>
>> Im really tempted to say find yourself another regex engine hacker.
>>
> And actually, I simply fundamentally reject this as a rule 1 ruling.
>
> If you want to get a rule 1 decision on this that I will accept plan
> to have me in the discussion.

Communities should function on merit, not on laurels. In terms of merit,
I vote to accept a decree from the most active and influential members
of the community with a history of recent successes - not from a past
active and influential member with past successes. I consider Yves to be
high on the list deserving of title based upon merit.

Free software is not "owned" by any single party. If Perl is free, it
cannot have a single person or company having sole veto power of its
future. Rule 1 and 2 need to be updated or removed entirely. In Canada,
we still have "on the books" that the Queen of England can veto any
decisions our government makes - but if the Queen ever used this
authority, disrespecting the choice of our nation, we would quickly
assert our independence and remove this power. The situation here is
similar - I cannot remember the last time Larry Wall posted to this
newsgroup or submitted a patch.

This community needs an injection of nurture and good will. If Larry is
reading this - I mean nothing personal. Perl was great for me for many
years, and I am glad you wrote it instead of whining about *sh/awk as
others did. But that was in the early '90s and before. It is now over
20 years later. The torch for Perl 5 has passed on to other people, as
it should. I think some respect for these new torch bearers is deserved.
The laurels should be passed on to those who maintain an active and
influential reputation based on merit. The veto power should be banished
or updated to reflect the current state of the community.

I don't accept that any person can be right even when they are wrong. I
reserve this sort of faith for my God and none other. Perl is a
programming language. It is a tool to fulfill a purpose. It is not divine.

Nicholas Clark

unread,
Oct 28, 2009, 6:14:23 AM10/28/09
to Mark Mielke, demerphq, jesse, karl williamson, Tatsuhiko Miyagawa, Perl5 Porters, yves....@booking.com, Tom Christiansen
On Wed, Oct 28, 2009 at 06:10:44AM -0400, Mark Mielke wrote:

> Free software is not "owned" by any single party. If Perl is free, it
> cannot have a single person or company having sole veto power of its
> future. Rule 1 and 2 need to be updated or removed entirely. In Canada,

That's fine with me.

You're as free as anyone else to fork http://perl5.git.perl.org/perl.git
within the terms of the licence, and go forth and build up a userbase.

Nicholas Clark

Demerphq

unread,
Oct 28, 2009, 6:55:10 AM10/28/09
to Mark Mielke, jesse, karl williamson, Tatsuhiko Miyagawa, Perl5 Porters, yves....@booking.com, Tom Christiansen
2009/10/28 Mark Mielke <ma...@mark.mielke.cc>:

> On 10/28/2009 05:01 AM, demerphq wrote:
>>
>> 2009/10/28 demerphq<deme...@gmail.com>:
>>
>>>
>>> We have bugs. Bugs cant be resolved by appeal to Larry.
>>>
>>> And since Larry isnt doing the work, and likely *I* will be then I
>>> reckon i should have been included in the discussion.
>>>
>>> Im really tempted to say find yourself another regex engine hacker.
>>>
>>
>> And actually, I simply fundamentally reject this as a rule 1 ruling.
>>
>> If you want to get a rule 1 decision on this that I will accept plan
>> to have me in the discussion.
>
> Communities should function on merit, not on laurels. In terms of merit, I
> vote to accept a decree from the most active and influential members of the
> community with a history of recent successes - not from a past active and
> influential member with past successes. I consider Yves to be high on the
> list deserving of title based upon merit.
>
> Free software is not "owned" by any single party. If Perl is free, it cannot
> have a single person or company having sole veto power of its future. Rule 1
> and 2 need to be updated or removed entirely. In Canada, we still have "on
> the books" that the Queen of England can veto any decisions our government
> makes - but if the Queen ever used this authority, disrespecting the choice
> of our nation, we would quickly assert our independence and remove this
> power. The situation here is similar - I cannot remember the last time Larry
> Wall posted to this newsgroup or submitted a patch.

Hi, while I am sympathetic to much of what you wrote here I do have a
different core opinion.

Every healthy community ends up with checks and balances to keep the
things flowing, in particular there is usual someone that serves the
role of chief justice. I view Larrys roles as pretty much chief
justice. As such I do accept his right to make final decisions
regardless as to his patch/post rate over the past while. Perl is his
and I absolutely do accept his right to make a final ruling.

What I do not accept, and I believe I have at least some support for
this in the community, is that a Rule 1 hearing has even happened.

If there is going to be a Rule 1 hearing then IMO the question needs
to be clear and the relevant parties able to present the facts. This
hasn't happened ergo there is no ruling.

Additionally my view is that even if I were to concede that such a
hearing has happened, and I do not, the question and answer are in my
opinion not relevant to the plans we have for \w \s and \d as we don't
plan to do what the question asked, and thus the answer is irrelevant.
Although it does imply that at one facet, that of default behaviour,
is resolved. Although I view this is non-controversial, as a careful
reading of my most recent posts say pretty much the same thing, with
the possible exception of \d, which is a subject I still consider open
on security grounds alone.

Mark Mielke

unread,
Oct 28, 2009, 7:06:36 AM10/28/09
to demerphq, jesse, karl williamson, Tatsuhiko Miyagawa, Perl5 Porters, yves....@booking.com, Tom Christiansen

For the right to be recognized means for the community policies to
acknowledge both the right, and the cost of a fork.

Above, you acknowledge the right, but not the cost, nor did you respond
to the question.

In terms of cost - I've seen the quick response of "go ahead and make
your own fork, and build up your own userbase" given in numerous forms
to numerous people with concerns - but there is a problem with this
statement. The word "your" suggests an emphasis on the investment
require to set up a fork in an effort to discourage or discredit the
concern. "Unless *YOU* are willing to invest in the effort to make a
successful fork, *YOUR* opinion is not worth considering." A community
should always be "our" - not "my" or "your". I think any argument that
relies on the inability of the challenger to personally invest in a
complete competitive replacement for the community is weak. So weak, in
fact, that forks are pretty common, especially for free / open source
projects. Many community leaders have found their entire community to
leave as a direct result of their failure to collaborate and compromise,
resting on the assumption that a fork is too much effort to occur.

In terms of the question:

Rule 1: Larry is always by definition right about how Perl should
behave. This means he has final veto power on the core functionality.

Yves: I simply fundamentally reject this as a rule 1 ruling.

Does the community vote to adhere to Rule 1 and enforce the ruling,
including putting in the effort to make the changes as required by Rule
1? Or does the community vote to waive Rule 1 in this case for Yves,
leading to a precedent of treating it as optional policy in the future?

Either Rule 1 will be enforced or it won't be. I think it shouldn't be,
and I think it won't be. I think the right to reject rule 1 has been put
on the table, and that the community should support Yves. Note that
right or wrong is not a factor here - the real complaint from Yves is:

> And since Larry isnt doing the work, and likely*I* will be then I


> > reckon i should have been included in the discussion.

This is a legitimate complaint. The right to be included in the
discussion, though, rejects rule 1. Rule 1 is pretty arrogant.

What do you think?

You may not have meant anything by what you responded - perhaps you even
meant to be benevolent. :-) You've walked into my rant, and if you find
my response offensive our outrageous, I apologize in advance. :-) It is
meant to be thoughtful and reflective.

Nicholas Clark

unread,
Oct 28, 2009, 7:16:44 AM10/28/09
to Mark Mielke, demerphq, jesse, karl williamson, Tatsuhiko Miyagawa, Perl5 Porters, yves....@booking.com, Tom Christiansen
On Wed, Oct 28, 2009 at 07:06:36AM -0400, Mark Mielke wrote:

> Does the community vote to adhere to Rule 1 and enforce the ruling,

> What do you think?

The community can vote all it frigging likes.

What matters is who contributes code, and whose thoughts influence those who
contribute code.

Anyone is free to have an opinion. Don't get me wrong on that.

But the conversion of opinions to actions determines what happens.


I've seen someone *complain* that perl5-porters is a meritocracy, because
that means that they're ignored. (Forget whom, and it was somewhere on IR)

(Ignore for the moment whether it is or isn't actually functioning as a
meritocracy).

Which struck me as naive on the part of the complainer, because the implicit
in the complaint was a rejection of "he who pays the piper calls the tune"


The community can do what the hell it likes. But if it doesn't cause people to

1: answer bug reports about the perl core code
2: locate the causes of bugs in the perl code
3: fix those bugs
4: contribute improvements to the perl core code

then the community, or those parts of it uninvolved in the above is
irrelevant here.

"cause" can be contribute time, contribute code, contribute funding to any
entity capable of converting money into the previous two.

Nicholas Clark

Mark Mielke

unread,
Oct 28, 2009, 7:17:23 AM10/28/09
to demerphq, jesse, karl williamson, Tatsuhiko Miyagawa, Perl5 Porters, yves....@booking.com, Tom Christiansen
On 10/28/2009 06:55 AM, demerphq wrote:
> Every healthy community ends up with checks and balances to keep the
> things flowing, in particular there is usual someone that serves the
> role of chief justice. I view Larrys roles as pretty much chief
> justice. As such I do accept his right to make final decisions
> regardless as to his patch/post rate over the past while. Perl is his
> and I absolutely do accept his right to make a final ruling.
>
> What I do not accept, and I believe I have at least some support for
> this in the community, is that a Rule 1 hearing has even happened.
>
> If there is going to be a Rule 1 hearing then IMO the question needs
> to be clear and the relevant parties able to present the facts. This
> hasn't happened ergo there is no ruling.
>

Haha. I think you believe in rule 2, then. :-)

Yes - we are different on this. I think chief justice as you call it
should be an elected term position, and should require certain
responsibilities which include a minimum level of activity in the
community. I don't believe in benevolent dictators - and especially not
absent or casual benevolent dictators.

I think the community should have the power to shape its own rules, and
the community should evaluate these rules periodically to ensure they
are providing the most value.

Anyways - unless other people feel the same, my opinions might be that
of a secluded community of one. In any case - I don't care what happens,
but I do want to ensure that you are respected, and that you are
influential in any discussion related to the substantial work you have
contributed, which specifically includes regex and unicode. You deserve
a hearing - and if it came right down to it, I would accept your call
over an absent benevolent dictator, even if it disagreed with my own
position. Thank you for your contributions.

I'll leave your mailboxes clear for a while again without my drivel.
Have a good week.

Jesse

unread,
Oct 28, 2009, 8:24:08 AM10/28/09
to demerphq, Perl5 Porters


On Wed, Oct 28, 2009 at 09:52:42AM +0100, demerphq wrote:
> 2009/10/28 jesse <je...@fsck.com>:
> > Tonight on #perl6, Larry made a fairly definitive statement about \w
> > matching and unicode:
> >
> > 20:50 <@TimToady> it's terribly bad Huffman coding to restrict \w to ascii
> > 20:50 <@TimToady> obra__: see ^^
> > 20:50 <@TimToady> please kill that idea
> > 20:51 <@TimToady> Perl 5 must not revert to ASCII semantics where we've been
> > gaining ground on Unicode for many years.
> >
> > So. Now we know where we're aiming. How good are our tests? ;)
> >
>
> I am *extremely* upset at this.
>
> Not only was your question not what we propose to do, but the fact
> that you didnt discuss it with me, or have this discussion with me
> makes me extremely unhappy.

Just for the record, I did not ask Larry a question about the regex
semantics or ask Larry to make a ruling. Larry made a statement on #perl6
and asked me to convey it to perl5-porters. I'd thought that any
changes to the regex engine in this area were currently on-hold as
"insanely complicated and need a bunch of work to get sorted out."

There _were_ a number of conflicting ideas about what \w and friends
should match by default. Some smarter and some crazier. In general, I had
understood us to be in good (no worse than 5.8) shape with regard to \w,
\d, and \s at the moment.

Larry doesn't say "must not" about Perl 5 often. When he does, it's
certainly newsworthy.

I promise that I wasn't trying to piss you off. Even to get back at
you for a certain two and a half hour discussion on history editing the
other day ;)

Best,
Jesse

Dr.Ruud

unread,
Oct 28, 2009, 4:52:23 AM10/28/09
to perl5-...@perl.org

How about some "use ascii;" / "use re 'ascii';"?

--
Ruud

Demerphq

unread,
Oct 28, 2009, 9:36:16 AM10/28/09
to jesse, Perl5 Porters
2009/10/28 jesse <je...@fsck.com>:

The original plan WAS overreaching and a bad idea and mea-culpa.

The refined plan was to make it configurable, with the exception of
\d, which I and many believe should default to ascii semantics as
there are very few applications where \d matching anything else is the
right thing to do.

If Larry really believes that \d matching thai digits, superscripts,
subscripts, and other bizzare things is the right huffman encoding for
\d, then I want him to say it on list himself so that when i close
tickets related to the subject I can point at his email. In particular
right now \d matches the following codepoint ranges:

0030 0039
0660 0669
06F0 06F9
07C0 07C9
0966 096F
09E6 09EF
0A66 0A6F
0AE6 0AEF
0B66 0B6F
0BE6 0BEF
0C66 0C6F
0CE6 0CEF
0D66 0D6F
0E50 0E59
0ED0 0ED9
0F20 0F29
1040 1049
1090 1099
17E0 17E9
1810 1819
1946 194F
19D0 19D9
1B50 1B59
1BB0 1BB9
1C40 1C49
1C50 1C59
A620 A629
A8D0 A8D9
A900 A909
AA50 AA59
FF10 FF19
104A0 104A9
1D7CE 1D7FF

This list has changed a couple of times as I recall, and no doubt will
again in some future version of unicode. So is that really right? For
\w the case is arguable either way so I dont object to making it match
unicode, but for \d? Do we really want to force every person matching
url parameters for an id to use [0-9] instead? This has time and again
been remarked upon as a bad call and a bug in waiting, to the extent
that many people avoid \d as being altogether too risky.

Your quote of a less than 60 second response to this question is not
sufficient in my book. And I think I've done enough work on Perl to
deserve such a mail.

> I promise that I wasn't trying to piss you off. Even to get back at
> you for a certain two and a half hour discussion on history editing the
> other day ;)

I understand. However I do think that if Larry wants to invoke Rule 1
he has to do it on list personally and address the concerns involved.
I don't really think that is an unreasonable expectation, at least in
this case.

Ricardo Signes

unread,
Oct 28, 2009, 9:45:51 AM10/28/09
to perl5-...@perl.org
* demerphq <deme...@gmail.com> [2009-10-28T09:36:16]

> The refined plan was to make it configurable, with the exception of
> \d, which I and many believe should default to ascii semantics as
> there are very few applications where \d matching anything else is the
> right thing to do.

As for most of the potential changes to \w and \s, I have not much opinion. In
all my code that expects Unicode, I have been careful, and I hope others have,
too.

As for \d, though, I am horrified to think how much bad behavior could be
introduced if \d started to match TITLE CASE KLINGON NUMERAL CHORGH

I think it is likely that I would not upgrade to a perl5 that introduced such
behavior. "Review every regex that uses \d" is not an acceptable burden.

--
rjbs

Demerphq

unread,
Oct 28, 2009, 9:51:45 AM10/28/09
to Ricardo Signes, perl5-...@perl.org
2009/10/28 Ricardo Signes <perl...@rjbs.manxome.org>:

What you just described is the present situation. And many people have
this bug and have done exactly what you said.

If unicode adds that codepoint, and gives it the property IsDigit then
it will start to match in some version of Perl in at least some
situations. The question is which situations those should be.

Ricardo Signes

unread,
Oct 28, 2009, 10:09:50 AM10/28/09
to perl5-...@perl.org
* demerphq <deme...@gmail.com> [2009-10-28T09:51:45]

> >
> > As for \d, though, I am horrified to think how much bad behavior could be
> > introduced if \d started to match TITLE CASE KLINGON NUMERAL CHORGH
> >
> > I think it is likely that I would not upgrade to a perl5 that introduced
> > such behavior. �"Review every regex that uses \d" is not an acceptable
> > burden.
>
> What you just described is the present situation. And many people have
> this bug and have done exactly what you said.

I stand both corrected and astonished.

--
rjbs

Abigail

unread,
Oct 28, 2009, 11:08:09 AM10/28/09
to demerphq, jesse, Perl5 Porters
On Wed, Oct 28, 2009 at 02:36:16PM +0100, demerphq wrote:
>
> This list has changed a couple of times as I recall, and no doubt will
> again in some future version of unicode. So is that really right? For
> \w the case is arguable either way so I dont object to making it match
> unicode, but for \d? Do we really want to force every person matching
> url parameters for an id to use [0-9] instead? This has time and again
> been remarked upon as a bad call and a bug in waiting, to the extent
> that many people avoid \d as being altogether too risky.


For \w to have Unicode semantics, and \d not, is IMO, worse than it's
now. By all means, whatever you do, make \w, \d, and \s either all match
ASCII only, or let \w, \d, \s be short hands for \p{IsWord}, \p{IsDigit}
and \p{IsSpacePerl}. But don't mix and match.

Abigail

Karl Williamson

unread,
Oct 28, 2009, 11:27:08 AM10/28/09
to Ricardo Signes, perl5-...@perl.org

Just for the record, Unicode has resisted so far the attempts to
formalize Klingon. But, it is available in an unofficial Private Use
area, partitioned and registered at http://www.evertype.com/standards/csur/
U+F8F0 KLINGON DIGIT ZERO
U+F8F1 KLINGON DIGIT ONE
U+F8F2 KLINGON DIGIT TWO
U+F8F3 KLINGON DIGIT THREE
U+F8F4 KLINGON DIGIT FOUR
U+F8F5 KLINGON DIGIT FIVE
U+F8F6 KLINGON DIGIT SIX
U+F8F7 KLINGON DIGIT SEVEN
U+F8F8 KLINGON DIGIT EIGHT
U+F8F9 KLINGON DIGIT NINE

Coincidentally, a linguist friend of mine told me this week that Klingon
is the 2nd most widely spoken made-up language in the world, after
Esperanto. So, Unicode may encode them. They already did encode GB
Shaw's alphabet, a Mormon alphabet, and there are serious proposals to
encode JRR Tolkien's alphabets.

David Nicol

unread,
Oct 28, 2009, 11:31:19 AM10/28/09
to perl5-...@perl.org
On Wed, Oct 28, 2009 at 8:45 AM, Ricardo Signes

> As for \d, though, I am
> horrified to think how much bad behavior could be
> introduced if \d started to match TITLE CASE KLINGON NUMERAL CHORGH

So is anyone working on making perl's atof function handle all these
additional code points? What's the unicode for NaN anyway?


--
warlorded myself

Karl Williamson

unread,
Oct 28, 2009, 12:11:52 PM10/28/09
to David Nicol, perl5-...@perl.org
I can't answer about atof, but all code points that aren't numeric in
some way have NaN as the value of the numeric value property for them,
which Perl doesn't currently expose. All these also have the numeric
type of None, which Perl also doesn't currently expose. Perl does
expose the other numeric types, so you could use a complement of the
and's of these, each of which written like \p{nt:decimal}.

I'm working to expose these other properties.

David Nicol

unread,
Oct 28, 2009, 1:21:30 PM10/28/09
to karl williamson, perl5-...@perl.org
On Wed, Oct 28, 2009 at 11:11 AM, karl williamson

> I'm working to expose these other properties.

will
say 0+'𒑢'
give 0.25?

That would be cool.

Karl Williamson

unread,
Oct 28, 2009, 1:31:36 PM10/28/09
to jesse, Abigail, Perl5 Porters, demerphq

I agree with Yves. This kind of decision needs a better airing than we
have received. It's not clear from this excerpt if Larry is aware of
the discussions and issues that have gone on in this forum. For
example, is he aware that this was to be configurable?


Karl Williamson

unread,
Oct 28, 2009, 1:33:24 PM10/28/09
to David Nicol, perl5-...@perl.org
Not without more work, but I'm setting things up so that this could be done.

Jesse

unread,
Oct 28, 2009, 1:44:13 PM10/28/09
to karl williamson, jesse, Abigail, Perl5 Porters, demerphq
> I agree with Yves. This kind of decision needs a better airing than we
> have received. It's not clear from this excerpt if Larry is aware of
> the discussions and issues that have gone on in this forum. For
> example, is he aware that this was to be configurable?

Yves and I have discussed this in some depth. My understanding is that
Larry was speaking of the default case. As I understand it, the current
point of contention is actually around default matching of \d. Yves is
writing up what he believes is "the right thing." I'll do my best to
make sure that if there are concerns with Yves' plan that they get
discussed by the relevant parties.

--

Demerphq

unread,
Oct 28, 2009, 1:52:42 PM10/28/09
to jesse, karl williamson, Abigail, Perl5 Porters
2009/10/28 jesse <je...@fsck.com>:

Yes I agree. I think it is best to let this thread die.

I will write up a new summary of what I believe to be a sane plan, and
then we can see if there really is anything to argue about.

As you say here, I believe that the only area of controversy *at this
point* is about \d, and I think that we can sort it out as a community
without resorting to rule 1 intervention. :-)

I carry a great deal of responsibility for what controversy there is
as I vastly underestimated the impact of changing the default
behaviour *at all* would cause, and it *is* reasonable that people
thought that original plan was a bad thing. I am sorry for any
community heartache that has caused.

John

unread,
Oct 28, 2009, 3:42:22 PM10/28/09
to jesse, demerphq, karl williamson, Tatsuhiko Miyagawa, Perl5 Porters, yves....@booking.com, Tom Christiansen
jesse wrote:
> Tonight on #perl6, Larry made a fairly definitive statement about \w
> matching and unicode:
>
> 20:50 <@TimToady> it's terribly bad Huffman coding to restrict \w to ascii
> 20:50 <@TimToady> obra__: see ^^
> 20:50 <@TimToady> please kill that idea
> 20:51 <@TimToady> Perl 5 must not revert to ASCII semantics where we've been
> gaining ground on Unicode for many years.
>
I'm sorry but this is just wrong.
At best this is Rule 2 and anyway we have been debating not removing do
(LIST) because it breakes perl 4 scripts. What do you think extending \w
\d and \s will do.

Further more taint washing is carried out by regexes and extending the
samantics of \w \d and \s could allow tainted data to be cleaned where
it should not.

If you think there are a lot of scripts out there that use do (LIST)
that will paile into insignificance to those scripts that assume \w ≡
[a-zA-Z0-9_]

Oh and from http://perldoc.perl.org/perlre.html

\w Match a "word" character (alphanumeric plus "_")

Now I don't see alphanumeric defined anywhere but I also don't see how
it can be forced to match 灞

John

______________________________________________
This email has been scanned by Netintelligence
http://www.netintelligence.com/email

Zefram

unread,
Oct 28, 2009, 4:41:20 PM10/28/09
to Perl5 Porters
John wrote:
>What do you think extending \w \d and \s will do.

Wrong tense.

The status quo, since 5.8 (with complicated status in 5.6) is that \w and
\s are unusably broken, but attempt to match Unicode character classes.
\d also attempts to match a Unicode character class, and afaik does
so successfully.

-zefram

Eric Brine

unread,
Oct 28, 2009, 4:40:09 PM10/28/09
to John, jesse, demerphq, karl williamson, Tatsuhiko Miyagawa, Perl5 Porters, yves....@booking.com, Tom Christiansen
On Wed, Oct 28, 2009 at 3:42 PM, John <john....@vodafoneemail.co.uk>wrote:

Now I don't see alphanumeric defined anywhere but I also don't see how it
> can be forced to match 灞
>

It already does

$ perl -v
This is perl, v5.8.8 built for i486-linux-gnu-thread-multi
...

$ perl -le'print chr(28766) =~ /^\w\z/ || 0'
1

Further more taint washing is carried out by regexes and extending the
> samantics of \w \d and \s could allow tainted data to be cleaned where it
> should not.
>

If you're using \w to filter out chinese characters, you're already failing.

What do you think extending \w \d and \s will do.
>

There's been no discussion of expanding them. The problem is that what they
match varies depending on Perl internals

$ perl -le'
$s1 = "\xC2";
$s2 = "\x{2660}";
for ($s1, $s2, $s1.$s2) {
print /\w/ || 0;
}
'
0
0
1

If there's no \w in s1 or in s2, why does their concatenation have one.

Abigail

unread,
Oct 28, 2009, 5:07:40 PM10/28/09
to Eric Brine, John, jesse, demerphq, karl williamson, Tatsuhiko Miyagawa, Perl5 Porters, yves....@booking.com, Tom Christiansen
On Wed, Oct 28, 2009 at 04:40:09PM -0400, Eric Brine wrote:
>
> If you're using \w to filter out chinese characters, you're already failing.
>

The fundamental question is, "what do you use \w for"? And is this usage
correct?

I seldomly use \w unless I know the text I match against is ASCII only.
And when I do use it against non-ASCII text, I know I'm creating technical
debt.

\d, I never use anymore. And I spend a lot of time explaining to people
that their use of \d in a regexp is actually wrong.


Abigail

Demerphq

unread,
Oct 28, 2009, 6:05:08 PM10/28/09
to Eric Brine, Perl5 Porters
2009/10/28 Eric Brine <ike...@adaelis.com>:

That is a really nice summation.

Juerd Waalboer

unread,
Oct 28, 2009, 6:50:50 AM10/28/09
to perl5-...@perl.org
Hi,

Yesterday I read in perl5110delta that \s, \w and \d would change to
ascii-only. I thought this was a bad idea for several distinct reasons,
and briefly discussed the change on IRC and Twitter, but not
perl5-porters, primarily because a discussion on this list often goes on
for a while. (To convince a group of highly knowledgeable people is
incredibly energy consuming; I'm not doing that anymore.)

Today I found perl5111delta; somehow I had failed to notice that the
changes were already reverted.

By the way, I asked in #perl6 about the direction for Perl 6, not
knowing that Larry would feel strongly about it, and not knowing he'd
invoke Rule 1. However, I'm glad that he did.

Now with "Rule 1" invoked and the changes already reverted, I feel
confident enough that I can post my thoughts here without the pressure
of having to convince anyone, or thinking I should.

For me, it's not just about embracing Unicode.


Ignoring Perl 5.11.0, there's a clear bug in Unicode capable Perl 5s
regarding string semantics. It took a while to reach concensus that this
is indeed a design bug, a broken abstraction: the semantics of several
operations are dependent on the internal representation of otherwise
indistinguishable values.

Specifically, several operations take only codepoints in the ASCII range
into account when the internal encoding of the operand string is not
UTF-8. This applies to built-in character classes and their shortcuts
and for functionality that deals with letter case (uc, lc, /i).

The historical behavior was to match only ASCII, but Unicode support was
added later. Most people remained ignorant about this change and its
implications. But anyone who is aware of the added Unicode support is
bitten by the hard to predict distinction between two sets of semantics.

Now, it's fairly simple to force Perl to use only the Unicode semantics.
Just utf8::upgrade the string and these operators and the regex engine
will behave predictably. And that's what I told people to do. Upgrade
your strings until Perl is fixed. I've told IRC, I've told YAPC and
workshop audiences, Perl Monks, and readers of the now-official
perlunifaq. I've heard others echo the advice and I've seen lots of
utf8::upgrade's in the wild.

Throughout the years I have always assumed that Perl would be fixed by
abandoning the ASCII-only behaviour, embracing Unicode as the default as
this had been the direction ever since 5.6. This assumption is
reflected in much of my writings and several talks on the subject.

It became painfully clear that a fix could never be made fully backward
compatible (except if the fix was enabled conditionally), but the nice
thing about the utf8::upgrade workaround is that you can repair your
existing code in a way that will continue to work even after the bug is
fixed. You can add the workaround to your code now, upgrade to 5.12
later, and then wait a few years before removing the calls to
utf8::upgrade or just leave them there. Even if utf8::upgrade were ever
removed from Perl, it'd be trivial to make it a no-op. It is safe to
add utf8::upgrade and be fairly certain that your code will continue to
function as it does today. (Modulo only some property changes in the
Unicode spec; this causing real problems is very rare.)

That is, until 5.11.0 introduced intentional regression. While ASCII-only
has worked well in the past, and may in specific circumstances even make
more sense in terms of performance and security, I've called going back
to this is an insanely bad idea. Larry agreed, noting that it strays
from the path of gaining on Unicode and that it is poor huffman coding.

But as I said, to me it is not just about embracing Unicode. It's also
about compatability. I agree that this is one of the incredibly rare
occasions where it's acceptable and maybe even necessary to break
backward compatibility. Going to Unicode-only means that the breakage
can be controlled by programmers. They can add utf8::upgrade statements
before upgrading perl, to make their code forward compatible. It
provides a way to prepare for the hefty change, and many have already
gone through their codebases looking for places to add this workaround.

Going (back) to ASCII-only, however, would not provide such a clean
upgrade path. There is no way to make your 5.10 code forward compatible
with 5.11.0, regarding \[dws], because there is no way to disable
matching out-of-ASCII-range characters in Perl 5.10. So the only way to
ensure a clean upgrade is to go through your code removing all uses of
\d, \w, and \s that could possibly match Unicode characters. (Which
would be extra painful for everyone who had already gone through it to
add utf8::upgrade calls!)

(Granted, there is a way that forces ASCII-only semantics but itbreaks
the whole flow of "receive, decode, process, encode, send": just encode
the string to UTF8 temporarily, do your match, and decode again.)

So I'm glad that 5.11.1 comes with renewed sanity, and I hope that
Larry's Rule 1 invocation will prevent the ASCII-only thing from
happening again. Perl 5.12 string semantics must be the same as Perl
5.10 semantics on utf8::upgrade'd strings; everything else should only
happen if explicitly requested by the programmer, preferrably as
lexically local as possible.

In the past I have suggested adding a /a flag to the regular
expression engine. (Blissfully unaware of how hard this would be to
implement.) It would be useful for those cases (mostly sysadmin work)
where you want to match only ASCII characters. It'd have to be a flag
instead of a pragma, so it survives in qr, and so it can be negated in
a subregex. I still believe that such a flag could be useful. (But I
absolutely do not insist on having it.)
--
Met vriendelijke groet, Kind regards, Korajn salutojn,

Juerd Waalboer: Perl hacker <#####@juerd.nl> <http://juerd.nl/sig>
Convolution: ICT solutions and consultancy <sa...@convolution.nl>

Dr.Ruud

unread,
Oct 29, 2009, 5:43:59 AM10/29/09
to perl5-...@perl.org
Mark Mielke wrote:

> I think chief justice as you call it
> should be an elected term position

Democracy has nothing to do with this, and we must make sure that it
never will. Consensus between involved people, and pro-activity; all
else needs mostly to be ignored.

--
Ruud (not involved much, so ignore)

Paul LeoNerd Evans

unread,
Oct 29, 2009, 8:46:30 AM10/29/09
to perl5-...@perl.org
On Wed, 28 Oct 2009 09:25:30 +0200
Yuval Kogman <nothi...@woobling.org> wrote:

> How difficult would it be to introduce special chars which aren't
> charclasses, which are probably more suitable for what people want anyway
> (things that agree with grok_number, with rules for natural numbers,
> integers, decimal fractions, and floating point notation)?
>
> Seems like the distinction between matching a character that is a digit vs.
> matching ascii digits is mostly about what you do with the numbers
> afterwords. Perhaps it's better to just remove the extra duplication?

Vim uses foo vs \_foo to distinguish whether a linefeed is included or
not; e.g.

abc. <= literal followed by anything except linefeed
abc\. <= literal followed by anything including linefeed

Maybe we can find some suitable mangling to apply to \w, \d, \s, etc...
to say "with extra Unicode chars like these"

--
Paul "LeoNerd" Evans

leo...@leonerd.org.uk
ICQ# 4135350 | Registered Linux# 179460
http://www.leonerd.org.uk/

signature.asc

Paul LeoNerd Evans

unread,
Oct 29, 2009, 9:01:41 AM10/29/09
to perl5-...@perl.org

If you're going to make \d match non-ASCII then please make this work

m/^\d+$/ and $count = $_+0;

signature.asc

demerphq

unread,
Oct 29, 2009, 9:21:02 AM10/29/09
to Paul LeoNerd Evans, perl5-...@perl.org
2009/10/29 Paul LeoNerd Evans <leo...@leonerd.org.uk>:

> On Wed, 28 Oct 2009 10:09:50 -0400
> Ricardo Signes <perl...@rjbs.manxome.org> wrote:
>
>> * demerphq <deme...@gmail.com> [2009-10-28T09:51:45]
>> > >
>> > > As for \d, though, I am horrified to think how much bad behavior could be
>> > > introduced if \d started to match TITLE CASE KLINGON NUMERAL CHORGH
>> > >
>> > > I think it is likely that I would not upgrade to a perl5 that introduced
>> > > such behavior.  "Review every regex that uses \d" is not an acceptable
>> > > burden.
>> >
>> > What you just described is the present situation. And many people have
>> > this bug and have done exactly what you said.
>>
>> I stand both corrected and astonished.
>
> If you're going to make \d match non-ASCII then please make this work
>
>  m/^\d+$/ and $count = $_+0;

It already matches non-ascii.

And no it doesnt work. And frankly the idea of making it work doesnt
make a lot of sense to me.

You really want "\x{0E50}\x{0ED0}" to be another way to write "11"?

They arent even in the same script.

Abigail

unread,
Oct 29, 2009, 10:14:20 AM10/29/09
to Paul LeoNerd Evans, perl5-...@perl.org
On Thu, Oct 29, 2009 at 12:46:30PM +0000, Paul LeoNerd Evans wrote:
> On Wed, 28 Oct 2009 09:25:30 +0200
> Yuval Kogman <nothi...@woobling.org> wrote:
>
> > How difficult would it be to introduce special chars which aren't
> > charclasses, which are probably more suitable for what people want anyway
> > (things that agree with grok_number, with rules for natural numbers,
> > integers, decimal fractions, and floating point notation)?
> >
> > Seems like the distinction between matching a character that is a digit vs.
> > matching ascii digits is mostly about what you do with the numbers
> > afterwords. Perhaps it's better to just remove the extra duplication?
>
> Vim uses foo vs \_foo to distinguish whether a linefeed is included or
> not; e.g.
>
> abc. <= literal followed by anything except linefeed
> abc\. <= literal followed by anything including linefeed
>
> Maybe we can find some suitable mangling to apply to \w, \d, \s, etc...
> to say "with extra Unicode chars like these"

\begin{not-really-serious}

I suggest \ḋ, \ṡ, and \ẇ for the Unicode character classes, and
\d, \s, \w for the ASCII versions.

For those not able to read my suggestions, it's

\N{LATIN SMALL LETTER D WITH DOT ABOVE} \x{1E0B}
\N{LATIN SMALL LETTER S WITH DOT ABOVE} \x{1E61}
\N{LATIN SMALL LETTER W WITH DOT ABOVE} \x{1E87}

\end{not-really-serious}

Abigail

Paul LeoNerd Evans

unread,
Oct 29, 2009, 11:14:30 AM10/29/09
to perl5-...@perl.org
On Thu, 29 Oct 2009 14:21:02 +0100
demerphq <deme...@gmail.com> wrote:

> And no it doesnt work. And frankly the idea of making it work doesnt
> make a lot of sense to me.
>
> You really want "\x{0E50}\x{0ED0}" to be another way to write "11"?
>
> They arent even in the same script.

No, I don't. That was meant to be an appeal to absurdity to suggest
"don't do this" :)

> It already matches non-ascii.

Ah. Then that's unfortunate, as now we can't use $1 numerically after
capturing it with (\d+), and know it'll work. This is what I was getting
at..

signature.asc

Paul LeoNerd Evans

unread,
Oct 29, 2009, 11:18:14 AM10/29/09
to perl5-...@perl.org
On Thu, 29 Oct 2009 12:46:30 +0000
Paul LeoNerd Evans <leo...@leonerd.org.uk> wrote:

> On Wed, 28 Oct 2009 09:25:30 +0200
> Yuval Kogman <nothi...@woobling.org> wrote:
>
> > How difficult would it be to introduce special chars which aren't
> > charclasses, which are probably more suitable for what people want anyway
> > (things that agree with grok_number, with rules for natural numbers,
> > integers, decimal fractions, and floating point notation)?
> >
> > Seems like the distinction between matching a character that is a digit vs.
> > matching ascii digits is mostly about what you do with the numbers
> > afterwords. Perhaps it's better to just remove the extra duplication?
>
> Vim uses foo vs \_foo to distinguish whether a linefeed is included or
> not; e.g.
>
> abc. <= literal followed by anything except linefeed
> abc\. <= literal followed by anything including linefeed

Sorry; I meant

abc\_.

signature.asc

Abigail

unread,
Oct 29, 2009, 12:04:07 PM10/29/09
to demerphq, Paul LeoNerd Evans, perl5-...@perl.org


Well, one of the reasons for \w to match more than ASCII characters
(first with locale, later with Unicode) was that it should be possible
to process 'words' in foreign scripts as well.

If we want '\w+' to be able to match Klingon "words", why shouldn't \d+
match Klingon numbers? Yes, \d+ matches digits from different scripts,
but \w+ matches word characters from different scripts as well.

Now, don't consider this an argument in favour of having \w match non-ASCII
characters - but, IMO, if \w can match non-ASCII characters, so should \d.

Abigail

Paul LeoNerd Evans

unread,
Oct 29, 2009, 12:24:53 PM10/29/09
to perl5-...@perl.org
On Thu, 29 Oct 2009 17:04:07 +0100
Abigail <abi...@abigail.be> wrote:

> Now, don't consider this an argument in favour of having \w match non-ASCII
> characters - but, IMO, if \w can match non-ASCII characters, so should \d.

This would seem to make the most sense, and be the most predictable.
Either all of them match Unicode, or none of them do.

If none of them do, then adding Unicode variations might be a nice idea.

I would suggest

word digit space
ASCII-only \w \d \s
Includes Unicode \Uw \Ud \Us

Only \U is already used. And \u.

Do we have a definitive list anywhere, on a tangential note, of the
remaining unused \x letters?

signature.asc

David Nicol

unread,
Oct 29, 2009, 12:26:43 PM10/29/09
to perl5-...@perl.org
On Thu, Oct 29, 2009 at 11:04 AM, Abigail <abi...@abigail.be> wrote:

> Now, don't consider this an argument in favour of having \w match non-ASCII
> characters - but, IMO, if \w can match non-ASCII characters, so should \d.

the constraint that anything that matches /\d+/ should numify to the
described number is a reasonable expectation. The alternative to
disallowing [^0-9] from \d is expanding numification to include
alternatvies. Creating a utof function that knows the values of all
the digitty characters from all scripts would require two steps
besides compiling the list. The first step is a big discussion to
decide how the edge cases, including but not limited to expressions
from mixed scripts, expressions from non-base-ten languages (why I
used cuneiform yesterday), expressions mixing scripts that mix
conventions, ambiguous expressions.

Should attempting to numify Ⅵ0 produce six or sixty or zero and a
warning or throw an exception but only under a new pragma and if so
how should that pragma be enabled, either via strict or autodie?

Should Ⅵ be expressible as ⅤⅠ?

If the base-ten atoi algorithm is simply applied based on unicode
numeric values, for instance, Ⅵ would be six but ⅤⅠ would be
sixty-one.

The second step, possible to do simultaneously with the ongoing first
step (the continuing proceedings of the working group on unicode to
numeric conversions in Perl) is implementing the decisions.

Paul LeoNerd Evans

unread,
Oct 29, 2009, 12:39:39 PM10/29/09
to perl5-...@perl.org
On Thu, 29 Oct 2009 16:24:53 +0000

Paul LeoNerd Evans <leo...@leonerd.org.uk> wrote:

> I would suggest
>
> word digit space
> ASCII-only \w \d \s
> Includes Unicode \Uw \Ud \Us
>
> Only \U is already used. And \u.

Actually then, on that note could we consider some more modifiers?

ASCII: m/\w/a m/\d/a m/\s/a
Unicode: m/\w/u m/\d/u m/\s/u

and if neither is specified keep to the existing behaviour..

signature.asc

Abigail

unread,
Oct 29, 2009, 12:40:25 PM10/29/09
to Paul LeoNerd Evans, perl5-...@perl.org
On Thu, Oct 29, 2009 at 04:24:53PM +0000, Paul LeoNerd Evans wrote:
> On Thu, 29 Oct 2009 17:04:07 +0100
> Abigail <abi...@abigail.be> wrote:
>
> > Now, don't consider this an argument in favour of having \w match non-ASCII
> > characters - but, IMO, if \w can match non-ASCII characters, so should \d.
>
> This would seem to make the most sense, and be the most predictable.
> Either all of them match Unicode, or none of them do.
>
> If none of them do, then adding Unicode variations might be a nice idea.
>
> I would suggest
>
> word digit space
> ASCII-only \w \d \s
> Includes Unicode \Uw \Ud \Us
>
> Only \U is already used. And \u.
>
> Do we have a definitive list anywhere, on a tangential note, of the
> remaining unused \x letters?

25% is still available (13 out of 52 upper and lower case ASCII characters).

perlrebackslash.pod lists all \x letters in use. It's easy to deduce
the unused ones: \F, \i, \I, \j, \J, \m, \M, \o, \O, \q, \T, \y, \Y.

\c, \g, \k, \p, \P, \x are "partially available", that is, currently they
can only be followed by a limited set of characters, so there's some
room for expansion left.

\N is partially available in 5.10.x, but taken in blead.


Abigail

Abigail

unread,
Oct 29, 2009, 12:46:55 PM10/29/09
to David Nicol, perl5-...@perl.org
On Thu, Oct 29, 2009 at 11:26:43AM -0500, David Nicol wrote:
> On Thu, Oct 29, 2009 at 11:04 AM, Abigail <abi...@abigail.be> wrote:
>
> > Now, don't consider this an argument in favour of having \w match non-ASCII
> > characters - but, IMO, if \w can match non-ASCII characters, so should \d.
>
> the constraint that anything that matches /\d+/ should numify to the
> described number is a reasonable expectation.

OTOH, not everything that numifies matches /\d+/.

> The alternative to
> disallowing [^0-9] from \d is expanding numification to include
> alternatvies. Creating a utof function that knows the values of all
> the digitty characters from all scripts would require two steps
> besides compiling the list. The first step is a big discussion to
> decide how the edge cases, including but not limited to expressions
> from mixed scripts, expressions from non-base-ten languages (why I
> used cuneiform yesterday), expressions mixing scripts that mix
> conventions, ambiguous expressions.
>
> Should attempting to numify Ⅵ0 produce six or sixty or zero and a
> warning or throw an exception but only under a new pragma and if so
> how should that pragma be enabled, either via strict or autodie?

Roman numberals are classified as "Number Letter" in the Unicode database,
and hence don't match /\d+/, which matches *digits*.

Abigail

Abigail

unread,
Oct 29, 2009, 12:52:08 PM10/29/09
to Graham Barr, Paul LeoNerd Evans, perl5-...@perl.org
On Thu, Oct 29, 2009 at 11:45:40AM -0500, Graham Barr wrote:

>
> On Oct 29, 2009, at 11:39 AM, Paul LeoNerd Evans wrote:
>
>> On Thu, 29 Oct 2009 16:24:53 +0000
>> Paul LeoNerd Evans <leo...@leonerd.org.uk> wrote:
>>
>>> I would suggest
>>>
>>> word digit space
>>> ASCII-only \w \d \s
>>> Includes Unicode \Uw \Ud \Us
>>>
>>> Only \U is already used. And \u.
>>
>> Actually then, on that note could we consider some more modifiers?
>>
>> ASCII: m/\w/a m/\d/a m/\s/a
>> Unicode: m/\w/u m/\d/u m/\s/u
>>
>> and if neither is specified keep to the existing behaviour..
>
> I was just having a similar thought. Although I do not think we would
> need both as it should be one or the other. With Unicode as the default
> we would only need /a. With this qr// patterns would also carry it with
> them


And then the 'short hand' for "[0-9]" becomes "(?a:\d)". ;-)

/(?u-a:\d)/: match any non-ASCII digit?

Abigail

Karl Williamson

unread,
Oct 29, 2009, 1:01:35 PM10/29/09
to David Nicol, perl5-...@perl.org
Non-unihan Unicode has three types of numbers. Decimal digit, other
digit, and other numeric. \d only matches decimal digits, which is the
"right" thing in my mind. "other digits" are like superscript 1. I
think it is a reasonable argument to make that \d shouldn't match
anything that you can't numify automatically; I don't think it is a good
idea to have it match superscripts nor roman numerals, nor fractions.

We could extend numification to handle any or all Unicode code points
that have a numeric value. But I don't think \d should match anything
more than decimal digits. There is a CJK ideograph that means 10**12.
People are expecting \d to match a single digit.

David Nicol

unread,
Oct 29, 2009, 2:36:14 PM10/29/09
to karl williamson, perl5-...@perl.org
On Thu, Oct 29, 2009 at 12:01 PM, karl williamson

> Non-unihan Unicode has three types of numbers.  Decimal digit, other digit,
> and other numeric. \d only matches decimal digits, which is the "right"
> thing in my mind.  "other digits" are like superscript 1.  I think it is a
> reasonable argument to make that \d shouldn't match anything that you can't
> numify automatically; I don't think it is a good idea to have it match
> superscripts nor roman numerals, nor fractions.

Oh. I left out the possibility of having nonsensical mixed expressions
return NaN after they warn, in case someone trying to implement
Text::Numeric::Any reads this thread some day.

> We could extend numification to handle any or all Unicode code points that
> have a numeric value.  But I don't think \d should match anything more than
> decimal digits.  There is a CJK ideograph that means 10**12. People are
> expecting \d to match a single digit.

so if \d only means [0-9] plus various other kinds of [0-9] in
different writing systems, and doesn't include, for instance, [①-⒛],
without changing the semantics any, utoi and utof pretty much write
themselves. What's the range of characters permissible for the point
in floating point and should \. match them too, if there are more?

Karl Williamson

unread,
Oct 29, 2009, 3:06:46 PM10/29/09
to David Nicol, perl5-...@perl.org
The decimal point is locale dependent and not specified in the Unicode
standard, but they have a CLDR (Common Locale Data Repository) project
that I believe contains that info.

Tom Christiansen

unread,
Oct 29, 2009, 3:11:46 PM10/29/09
to karl williamson, David Nicol, perl5-...@perl.org, Abigail
Karl wrote:

> Non-unihan Unicode has three types of numbers. Decimal digit, other
> digit, and other numeric. \d only matches decimal digits, which is the
> "right" thing in my mind. "other digits" are like superscript 1. I
> think it is a reasonable argument to make that \d shouldn't match
> anything that you can't numify automatically; I don't think it is a good
> idea to have it match superscripts nor roman numerals, nor fractions.

> We could extend numification to handle any or all Unicode code points
> that have a numeric value. But I don't think \d should match anything
> more than decimal digits. There is a CJK ideograph that means 10**12.
> People are expecting \d to match a single digit.

There may be issues even with just that. What's a "digit",
really? Some code points that by name call themselves DIGITs
are \d, but some calling themselves DIGITs aren't--they're \D,
like all the counting-rod digits:

\D U+1d360 COUNTING ROD UNIT DIGIT ONE
\D U+1d369 COUNTING ROD TENS DIGIT ONE

True, those seem no great loss to ignore, but I'm not so sure
all others are. Some scripts' DIGITs are actually mixed \d
and \D, like:

\d U+00f21 TIBETAN DIGIT ONE
\D U+00f2a TIBETAN DIGIT HALF ONE

\d U+00c67 TELUGU DIGIT ONE
\D U+00c79 TELUGU FRACTION DIGIT ONE FOR ODD POWERS OF FOUR

I imagine, but do not know, that numbers in those scripts might
be composed of a mixture of \d and \D DIGITs.

Ethiopic DIGITs are *all* \D, and read left to right:

\D L U+01369 ETHIOPIC DIGIT ONE

Whereas all Kharoshthi DIGITs are also \D, but read right to left:

\D R U+10a40 KHAROSHTHI DIGIT ONE

You'd think it'd be less confusing just within say, European
Numbers, but it isn't that obviously so. Our "1", the Arabic
numeral digit one, is of \p{BidiClass:EN}, a European Number
not an Arabic one:

\d EN U+00031 DIGIT ONE

While Arabic-Indic's digit one is instead \p{BidiClass:AN}, an
Arabic numeral that indeed counts as an Arabic Number:

\d AN U+00661 ARABIC-INDIC DIGIT ONE

Yet the *extended* Arabic-Indic's digit one has swapped back to
being a European Number again:

\d EN U+006f1 EXTENDED ARABIC-INDIC DIGIT ONE

If automagic atod()ish nummification [hmm: or numefication?*]
for digit-strings is the goal, I don't know how to handle digit
strings composed of digits from various scripts. Besides the \d
\D issues above, even when restricted to \d digits, directional
concerns remain.

Most go left to right:

\d L U+009e7 BENGALI DIGIT ONE
\d L U+00e51 THAI DIGIT ONE
\d L U+00f21 TIBETAN DIGIT ONE

But Nko digits, which are real \d digits, are written right to left:

\d R U+007c1 NKO DIGIT ONE

I think it's probably prudent to avoid most or all non-\D
numbers, like subscripts and superscripts, even when those
count as European Numbers:

\D EN U+000b9 SUPERSCRIPT ONE
\D EN U+02081 SUBSCRIPT ONE
\D EN U+02488 DIGIT ONE FULL STOP
\D ON U+02460 CIRCLED DIGIT ONE
\D ON U+0278a DINGBAT NEGATIVE CIRCLED SANS-SERIF DIGIT ONE

If somebody is asked to enter the year like for copyrights, then
2009 of course works perfectly fine, and so do some other scripts.
It seems too specific to the problem domain, not to mention just
plain dangerous, to try to cope with Roman numerals, whether
entered as MMIX or in the Unicode versions.

Other writing systems than Roman/Latin also use their letters
as numeric digits. If the current Hebrew year is 2009 + 3360
= 5769, they'd then write that--from right to left--as 769,
dropping the 5000 by convention, this way:

\D R U+005ea HEBREW LETTER TAV (number 400)
\D R U+005e9 HEBREW LETTER SHIN (number 300)
\D R U+005e1 HEBREW LETTER SAMEKH (number 60)
\D R U+005d8 HEBREW LETTER TET (number 9)

Actually, order doesn't really matter: it's not positional, just
accumulative. Since we'd boot the Romans, I guess we'd boot the
Hebrews, too; don't see any way around it, really.

Just a few conundra I've been tossing around.

--tom
--

* numinal: divine.

numinous: of or pertaining to a numen; divine, spiritual,
revealing or suggesting the presence of a god; inspiring
awe and reverence. Hence numiosity, numinousness, numinously.

numify: to apotheosize.

Mark Mielke

unread,
Oct 29, 2009, 3:18:03 PM10/29/09
to Juerd Waalboer, perl5-...@perl.org
On 10/28/2009 06:50 AM, Juerd Waalboer wrote:
> The historical behavior was to match only ASCII, but Unicode support was
> added later. Most people remained ignorant about this change and its
> implications. But anyone who is aware of the added Unicode support is
> bitten by the hard to predict distinction between two sets of semantics.
>
> Now, it's fairly simple to force Perl to use only the Unicode semantics.
> Just utf8::upgrade the string and these operators and the regex engine
> will behave predictably. And that's what I told people to do. Upgrade
> your strings until Perl is fixed. I've told IRC, I've told YAPC and
> workshop audiences, Perl Monks, and readers of the now-official
> perlunifaq. I've heard others echo the advice and I've seen lots of
> utf8::upgrade's in the wild.
>
> Throughout the years I have always assumed that Perl would be fixed by
> abandoning the ASCII-only behaviour, embracing Unicode as the default as
> this had been the direction ever since 5.6. This assumption is
> reflected in much of my writings and several talks on the subject.
>

Lots of good points about the breakage of utf-8 vs non-utf-8. This
should be fixed, but it's not really related to concerns I have about
\d. I think you and I agree that users should not need to know whether
string is utf-8 or non-utf-8. All Perl operations should, if at all
possible, operate the same, no matter what internal form it happens to
use to encode the string. It is a leaky abstraction for every exception
to this rule. It's bad, and it has prevented Perl from successfully
reaching the state of "embracing Unicode."

My concerns for \d at well covered by other people, even in posts today.
I have LOTS of code that uses \d written over the last 20 years, and it
is very concerning to me that this code which uses \d as a guard against
invalid input, may be accepting Unicode characters which cause my
programs to break, cause my programs to be exploitable over the network,
or cause data corruption.

I have a lot of code that does something like:

$number = /\A\d+\z/ ? 0+$_ :
die "...";

It scares me that this code may now be broken, or may become broken.

Was I wrong to use \d+? What should I have used? I've been taught to
avoid [0-9] in all languages since before I learned Perl, due to silly
character sets like EBCDIC. Now it seems like my only choice is:

$number = /\A[0123456789]+\z/ ? 0 + $_ :
die "...";

To me, that's just insanity.

I don't think \d should match anything that would allow /\A\d+\z/ to
result in a value where ("$_" != 0+$0). Go ahead and make 0+$0 better,
or go ahead and make \d match only ASCII '0' through '9' - but anything
that causes this "identity" break is a BAD decision. Heck, even go ahead
and make a BAD decision - but the result will be that I recommend
against using Perl. I trust my programming language to do what I tell it
to. If it starts doing stupid things with each future release - I will
stop trusting it, and I will stop using it.

Cheers,
mark

--
Mark Mielke<ma...@mielke.cc>

Karl Williamson

unread,
Oct 29, 2009, 3:48:42 PM10/29/09
to Mark Mielke, Juerd Waalboer, perl5-...@perl.org

Here is a reference about Unicode security issues:
http://unicode.org/reports/tr36/

And here is an excerpt from that:

Turning away from the focus on domain names for a moment, there is
another area where visual spoofs can be used. Many scripts have sets of
decimal digits that are different in shape from the typical European
digits {0}. For example, Bengali has {০ ১ ২ ৩ ৪ ৫ ৬ ৭ ৮ ৯}, while
Oriya has {୦ ୧ ୨ ୩ ୪ ୫ ୬ ୭ ୮ ୯}. While the sets taken as a whole are
different in shape, individual digits may have the same shapes as digits
from other scripts, even digits of different values. For example, the
string ৪୨ is visually confusable with 89 (at small sizes), but actually
has the numeric value 42. Where software interprets the numeric value of
a string of digits without detecting that the digits are from different
scripts, it is possible to generate such spoofs.

John

unread,
Oct 29, 2009, 4:11:13 PM10/29/09
to Abigail, demerphq, Paul LeoNerd Evans, perl5-...@perl.org
Abigail wrote:
> If we want '\w+' to be able to match Klingon "words", why shouldn't \d+
> match Klingon numbers? Yes, \d+ matches digits from different scripts,
> but \w+ matches word characters from different scripts as well.
>
> Now, don't consider this an argument in favour of having \w match non-ASCII
> characters - but, IMO, if \w can match non-ASCII characters, so should \d.
>
>
>
> Abigail
>
>
The unicode character database file UnicodeData.txt contains in fields
6,7 and 8 the value of numeric characters. Could we not use that to
numeifiy characters such as Ⅲand Ⅸ so Ⅸ - Ⅳ == 5

John

______________________________________________
This email has been scanned by Netintelligence
http://www.netintelligence.com/email

John

unread,
Oct 29, 2009, 4:19:42 PM10/29/09
to Abigail, David Nicol, perl5-...@perl.org
Abigail wrote:
> Roman numberals are classified as "Number Letter" in the Unicode database,
> and hence don't match /\d+/, which matches *digits*.
>
>
>
> Abigail
>
>
>

But Character.getNumericValue('Ⅳ') == 4 in Java script so there is a
numeric mapping for the roman numarals.

See http://www.fileformat.info/info/unicode/char/2163/index.htm

John

unread,
Oct 29, 2009, 4:25:17 PM10/29/09
to karl williamson, David Nicol, perl5-...@perl.org
karl williamson wrote:
> The decimal point is locale dependent and not specified in the Unicode
> standard, but they have a CLDR (Common Locale Data Repository) project
> that I believe contains that info.
>
I'm currently trying to create a Perl libry that will handle CLDR data
see http://github.com/ThePilgrim/perlcldr for more

Karl Williamson

unread,
Oct 29, 2009, 4:34:58 PM10/29/09
to Tom Christiansen, David Nicol, perl5-...@perl.org, Abigail, Juerd Waalboer
Tom Christiansen wrote:
> Karl wrote:
>
>> Non-unihan Unicode has three types of numbers. Decimal digit, other
>> digit, and other numeric. \d only matches decimal digits, which is the
>> "right" thing in my mind. "other digits" are like superscript 1. I
>> think it is a reasonable argument to make that \d shouldn't match
>> anything that you can't numify automatically; I don't think it is a good
>> idea to have it match superscripts nor roman numerals, nor fractions.
>
>> We could extend numification to handle any or all Unicode code points
>> that have a numeric value. But I don't think \d should match anything
>> more than decimal digits. There is a CJK ideograph that means 10**12.
>> People are expecting \d to match a single digit.
>
> There may be issues even with just that. What's a "digit",
> really? Some code points that by name call themselves DIGITs
> are \d, but some calling themselves DIGITs aren't--they're \D,
> like all the counting-rod digits:
>
> \D U+1d360 COUNTING ROD UNIT DIGIT ONE
> \D U+1d369 COUNTING ROD TENS DIGIT ONE
>
> True, those seem no great loss to ignore, but I'm not so sure
> all others are. Some scripts' DIGITs are actually mixed \d
> and \D, like:

If we decide to continue to allow \d to match non-ASCII, and I'm not
advocating that, it should only match decimal digits, regardless of the
names of the characters.


>
> \d U+00f21 TIBETAN DIGIT ONE
> \D U+00f2a TIBETAN DIGIT HALF ONE
>

Here is an example of a poor choice of name, or at least one that is
misunderstood by people. The term DIGIT in Unicode means only that it
is a single character that has a numeric meaning, much like we refer to
the ASCII F as a hexadecimal digit. So a DIGIT in Unicode doesn't have
to mean 0 through 9.

In this particular case, the HALF ONE means that this is 1 - .5 = .5,
which is the numeric value of the character. So it is a digit that
means a non-integral number. I think it should have been named ONE
MINUS HALF. There is a HALF ZERO whose value is -.5. I got curious a
while back as to why Tibetan of all languages in the world would have a
single character encoding the concept of -.5, so I looked it up on the
internet. IIRC, The claim was that all these half digits are based on a
single Tibetan postage stamp of one of the values, and that the others
were inferred artificially based on Tibetan grammatical rules, and there
is no concrete evidence that they really ever existed except for one of
them on that one stamp. There's a picture of it on the internet.

> \d U+00c67 TELUGU DIGIT ONE
> \D U+00c79 TELUGU FRACTION DIGIT ONE FOR ODD POWERS OF FOUR
>
> I imagine, but do not know, that numbers in those scripts might
> be composed of a mixture of \d and \D DIGITs.

U+0C79 also has numeric value one, but it appears to be more like a
remainder after taking a number mod 4, so I doubt that it stands on its own.

It seems we may be presuming a base 10 system inappropriately at times.


>
> Ethiopic DIGITs are *all* \D, and read left to right:
>
> \D L U+01369 ETHIOPIC DIGIT ONE
>
> Whereas all Kharoshthi DIGITs are also \D, but read right to left:
>
> \D R U+10a40 KHAROSHTHI DIGIT ONE
>
> You'd think it'd be less confusing just within say, European
> Numbers, but it isn't that obviously so. Our "1", the Arabic
> numeral digit one, is of \p{BidiClass:EN}, a European Number
> not an Arabic one:
>
> \d EN U+00031 DIGIT ONE
>
> While Arabic-Indic's digit one is instead \p{BidiClass:AN}, an
> Arabic numeral that indeed counts as an Arabic Number:
>
> \d AN U+00661 ARABIC-INDIC DIGIT ONE
>
> Yet the *extended* Arabic-Indic's digit one has swapped back to
> being a European Number again:
>
> \d EN U+006f1 EXTENDED ARABIC-INDIC DIGIT ONE
>

These classifications of or European/Arab number are solely for
implementing the Unicode Bidirectional Algorithm, and don't mean
anything beyond that. Again, using the decimal digit type would be the
way to go, if we continue to go there.

> If automagic atod()ish nummification [hmm: or numefication?*]
> for digit-strings is the goal, I don't know how to handle digit
> strings composed of digits from various scripts. Besides the \d
> \D issues above, even when restricted to \d digits, directional
> concerns remain.
>
> Most go left to right:
>
> \d L U+009e7 BENGALI DIGIT ONE
> \d L U+00e51 THAI DIGIT ONE
> \d L U+00f21 TIBETAN DIGIT ONE
>
> But Nko digits, which are real \d digits, are written right to left:
>
> \d R U+007c1 NKO DIGIT ONE

These are some of the reasons I'm leery of allowing \d to match outside
ASCII by default. I don't get the symmetry argument of Abigail's. A
number of people have responded recently about how they're surprised at
how it really works now. Much code has been written that assumes that
\d is [0-9], and that digits are part of a base 10 number written
left-to-right. I'm not sure right now if there is a way for a program
that doesn't use Encode or the command line options or I/O layers to get
Unicode data unexpectedly. But even if they are, Unicode is a large
beast, and someone may be getting more than they bargained for.
>

It is loading more messages.
0 new messages