Rule 1 has been invoked [Re: What should \s \w \d match in 5.12?]
More options Oct 28 2009, 6:50 am
Newsgroups: perl.perl5.porters
From: ju...@convolution.nl (Juerd Waalboer)
Date: Wed, 28 Oct 2009 11:50:50 +0100
Local: Wed, Oct 28 2009 6:50 am
Subject: Re: Rule 1 has been invoked [Re: What should \s \w \d match in 5.12?]
Hi,

Yesterday I read in perl5110delta that \s, \w and \d would change to
ascii-only. I thought this was a bad idea for several distinct reasons,
and briefly discussed the change on IRC and Twitter, but not
perl5-porters, primarily because a discussion on this list often goes on
for a while. (To convince a group of highly knowledgeable people is
incredibly energy consuming; I'm not doing that anymore.)

Today I found perl5111delta; somehow I had failed to notice that the

By the way, I asked in #perl6 about the direction for Perl 6, not
knowing that Larry would feel strongly about it, and not knowing he'd
invoke Rule 1. However, I'm glad that he did.

Now with "Rule 1" invoked and the changes already reverted, I feel
confident enough that I can post my thoughts here without the pressure
of having to convince anyone, or thinking I should.

For me, it's not just about embracing Unicode.

Ignoring Perl 5.11.0, there's a clear bug in Unicode capable Perl 5s
regarding string semantics. It took a while to reach concensus that this
is indeed a design bug, a broken abstraction: the semantics of several
operations are dependent on the internal representation of otherwise
indistinguishable values.

Specifically, several operations take only codepoints in the ASCII range
into account when the internal encoding of the operand string is not
UTF-8. This applies to built-in character classes and their shortcuts
and for functionality that deals with letter case (uc, lc, /i).

The historical behavior was to match only ASCII, but Unicode support was
implications. But anyone who is aware of the added Unicode support is
bitten by the hard to predict distinction between two sets of semantics.

Now, it's fairly simple to force Perl to use only the Unicode semantics.
Just utf8::upgrade the string and these operators and the regex engine
will behave predictably. And that's what I told people to do. Upgrade
your strings until Perl is fixed. I've told IRC, I've told YAPC and
workshop audiences, Perl Monks, and readers of the now-official
perlunifaq. I've heard others echo the advice and I've seen lots of

Throughout the years I have always assumed that Perl would be fixed by
abandoning the ASCII-only behaviour, embracing Unicode as the default as
this had been the direction ever since 5.6. This assumption is
reflected in much of my writings and several talks on the subject.

It became painfully clear that a fix could never be made fully backward
compatible (except if the fix was enabled conditionally), but the nice
existing code in a way that will continue to work even after the bug is
later, and then wait a few years before removing the calls to
removed from Perl, it'd be trivial to make it a no-op. It is safe to
function as it does today. (Modulo only some property changes in the
Unicode spec; this causing real problems is very rare.)

That is, until 5.11.0 introduced intentional regression. While ASCII-only
has worked well in the past, and may in specific circumstances even make
more sense in terms of performance and security, I've called going back
to this is an insanely bad idea. Larry agreed, noting that it strays
from the path of gaining on Unicode and that it is poor huffman coding.

But as I said, to me it is not just about embracing Unicode. It's also
about compatability. I agree that this is one of the incredibly rare
occasions where it's acceptable and maybe even necessary to break
backward compatibility. Going to Unicode-only means that the breakage
before upgrading perl, to make their code forward compatible. It
provides a way to prepare for the hefty change, and many have already
gone through their codebases looking for places to add this workaround.

Going (back) to ASCII-only, however, would not provide such a clean
upgrade path. There is no way to make your 5.10 code forward compatible
with 5.11.0, regarding \[dws], because there is no way to disable
matching out-of-ASCII-range characters in Perl 5.10. So the only way to
ensure a clean upgrade is to go through your code removing all uses of
\d, \w, and \s that could possibly match Unicode characters. (Which
would be extra painful for everyone who had already gone through it to

(Granted, there is a way that forces ASCII-only semantics but itbreaks
the whole flow of "receive, decode, process, encode, send": just encode
the string to UTF8 temporarily, do your match, and decode again.)

So I'm glad that 5.11.1 comes with renewed sanity, and I hope that
Larry's Rule 1 invocation will prevent the ASCII-only thing from
happening again. Perl 5.12 string semantics must be the same as Perl
5.10 semantics on utf8::upgrade'd strings; everything else should only
happen if explicitly requested by the programmer, preferrably as
lexically local as possible.

In the past I have suggested adding a /a flag to the regular
expression engine. (Blissfully unaware of how hard this would be to
implement.) It would be useful for those cases (mostly sysadmin work)
where you want to match only ASCII characters. It'd have to be a flag
instead of a pragma, so it survives in qr, and so it can be negated in
a subregex. I still believe that such a flag could be useful. (But I
absolutely do not insist on having it.)
--
Met vriendelijke groet,  Kind regards,  Korajn salutojn,

Juerd Waalboer:  Perl hacker  <##...@juerd.nl>  <http://juerd.nl/sig>
Convolution:     ICT solutions and consultancy <sa...@convolution.nl>

More options Oct 29 2009, 5:43 am
Newsgroups: perl.perl5.porters
From: rvtol+use...@isolution.nl (Dr.Ruud)
Date: Thu, 29 Oct 2009 10:43:59 +0100
Local: Thurs, Oct 29 2009 5:43 am
Subject: Re: Rule 1 has been invoked [Re: What should \s \w \d match in 5.12?]

Mark Mielke wrote:
> I think chief justice as you call it
> should be an elected term position

Democracy has nothing to do with this, and we must make sure that it
never will. Consensus between involved people, and pro-activity; all
else needs mostly to be ignored.

--
Ruud (not involved much, so ignore)

More options Oct 29 2009, 8:46 am
Newsgroups: perl.perl5.porters
From: leon...@leonerd.org.uk (Paul LeoNerd Evans)
Date: Thu, 29 Oct 2009 12:46:30 +0000
Local: Thurs, Oct 29 2009 8:46 am
Subject: Re: Rule 1 has been invoked [Re: What should \s \w \d match in 5.12?]

On Wed, 28 Oct 2009 09:25:30 +0200

Yuval Kogman <nothingm...@woobling.org> wrote:
> How difficult would it be to introduce special chars which aren't
> charclasses, which are probably more suitable for what people want anyway
> (things that agree with grok_number, with rules for natural numbers,
> integers, decimal fractions, and floating point notation)?

> Seems like the distinction between matching a character that is a digit vs.
> matching ascii digits is mostly about what you do with the numbers
> afterwords. Perhaps it's better to just remove the extra duplication?

Vim uses foo vs \_foo to distinguish whether a linefeed is included or
not; e.g.

abc.   <= literal followed by anything except linefeed
abc\.  <= literal followed by anything including linefeed

Maybe we can find some suitable mangling to apply to \w, \d, \s, etc...
to say "with extra Unicode chars like these"

--
Paul "LeoNerd" Evans

leon...@leonerd.org.uk
ICQ# 4135350       |  Registered Linux# 179460
http://www.leonerd.org.uk/

More options Oct 29 2009, 9:01 am
Newsgroups: perl.perl5.porters
From: leon...@leonerd.org.uk (Paul LeoNerd Evans)
Date: Thu, 29 Oct 2009 13:01:41 +0000
Local: Thurs, Oct 29 2009 9:01 am
Subject: Re: Rule 1 has been invoked [Re: What should \s \w \d match in 5.12?]

On Wed, 28 Oct 2009 10:09:50 -0400

Ricardo Signes <perl....@rjbs.manxome.org> wrote:
> * demerphq <demer...@gmail.com> [2009-10-28T09:51:45]

> > > As for \d, though, I am horrified to think how much bad behavior could be
> > > introduced if \d started to match TITLE CASE KLINGON NUMERAL CHORGH

> > > I think it is likely that I would not upgrade to a perl5 that introduced
> > > such behavior.  "Review every regex that uses \d" is not an acceptable
> > > burden.

> > What you just described is the present situation. And many people have
> > this bug and have done exactly what you said.

> I stand both corrected and astonished.

If you're going to make \d match non-ASCII then please make this work

m/^\d+$/ and$count = $_+0; -- Paul "LeoNerd" Evans leon...@leonerd.org.uk ICQ# 4135350 | Registered Linux# 179460 http://www.leonerd.org.uk/  signature.asc < 1K Download You must Sign in before you can post messages. To post a message you must first join this group. Please update your nickname on the subscription settings page before posting. You do not have the permission required to post. More options Oct 29 2009, 9:21 am Newsgroups: perl.perl5.porters From: demer...@gmail.com (demerphq) Date: Thu, 29 Oct 2009 14:21:02 +0100 Local: Thurs, Oct 29 2009 9:21 am Subject: Re: Rule 1 has been invoked [Re: What should \s \w \d match in 5.12?] 2009/10/29 Paul LeoNerd Evans <leon...@leonerd.org.uk>: It already matches non-ascii. And no it doesnt work. And frankly the idea of making it work doesnt make a lot of sense to me. You really want "\x{0E50}\x{0ED0}" to be another way to write "11"? They arent even in the same script. cheers, Yves -- perl -Mre=debug -e "/just|another|perl|hacker/" You must Sign in before you can post messages. To post a message you must first join this group. Please update your nickname on the subscription settings page before posting. You do not have the permission required to post. More options Oct 29 2009, 10:14 am Newsgroups: perl.perl5.porters From: abig...@abigail.be (Abigail) Date: Thu, 29 Oct 2009 15:14:20 +0100 Local: Thurs, Oct 29 2009 10:14 am Subject: Re: Rule 1 has been invoked [Re: What should \s \w \d match in 5.12?] \begin{not-really-serious} I suggest \ḋ, \ṡ, and \ẇ for the Unicode character classes, and \d, \s, \w for the ASCII versions. For those not able to read my suggestions, it's \N{LATIN SMALL LETTER D WITH DOT ABOVE} \x{1E0B} \N{LATIN SMALL LETTER S WITH DOT ABOVE} \x{1E61} \N{LATIN SMALL LETTER W WITH DOT ABOVE} \x{1E87} \end{not-really-serious} Abigail You must Sign in before you can post messages. To post a message you must first join this group. Please update your nickname on the subscription settings page before posting. You do not have the permission required to post. More options Oct 29 2009, 11:14 am Newsgroups: perl.perl5.porters From: leon...@leonerd.org.uk (Paul LeoNerd Evans) Date: Thu, 29 Oct 2009 15:14:30 +0000 Local: Thurs, Oct 29 2009 11:14 am Subject: Re: Rule 1 has been invoked [Re: What should \s \w \d match in 5.12?] On Thu, 29 Oct 2009 14:21:02 +0100 demerphq <demer...@gmail.com> wrote: > And no it doesnt work. And frankly the idea of making it work doesnt > make a lot of sense to me. > You really want "\x{0E50}\x{0ED0}" to be another way to write "11"? > They arent even in the same script. No, I don't. That was meant to be an appeal to absurdity to suggest "don't do this" :) > It already matches non-ascii. Ah. Then that's unfortunate, as now we can't use$1 numerically after
capturing it with (\d+), and know it'll work. This is what I was getting
at..

--
Paul "LeoNerd" Evans

leon...@leonerd.org.uk
ICQ# 4135350       |  Registered Linux# 179460
http://www.leonerd.org.uk/

More options Oct 29 2009, 11:18 am
Newsgroups: perl.perl5.porters
From: leon...@leonerd.org.uk (Paul LeoNerd Evans)
Date: Thu, 29 Oct 2009 15:18:14 +0000
Local: Thurs, Oct 29 2009 11:18 am
Subject: Re: Rule 1 has been invoked [Re: What should \s \w \d match in 5.12?]

On Thu, 29 Oct 2009 12:46:30 +0000
Paul LeoNerd Evans <leon...@leonerd.org.uk> wrote:

Sorry; I meant

abc\_.

> Maybe we can find some suitable mangling to apply to \w, \d, \s, etc...
> to say "with extra Unicode chars like these"

--
Paul "LeoNerd" Evans

leon...@leonerd.org.uk
ICQ# 4135350       |  Registered Linux# 179460
http://www.leonerd.org.uk/

More options Oct 29 2009, 12:04 pm
Newsgroups: perl.perl5.porters
From: abig...@abigail.be (Abigail)
Date: Thu, 29 Oct 2009 17:04:07 +0100
Local: Thurs, Oct 29 2009 12:04 pm
Subject: Re: Rule 1 has been invoked [Re: What should \s \w \d match in 5.12?]

Well, one of the reasons for \w to match more than ASCII characters
(first with locale, later with Unicode) was that it should be possible
to process 'words' in foreign scripts as well.

If we want '\w+' to be able to match Klingon "words", why shouldn't \d+
match Klingon numbers? Yes, \d+ matches digits from different scripts,
but \w+ matches word characters from different scripts as well.

Now, don't consider this an argument in favour of having \w match non-ASCII
characters - but, IMO, if \w can match non-ASCII characters, so should \d.

Abigail

More options Oct 29 2009, 12:24 pm
Newsgroups: perl.perl5.porters
From: leon...@leonerd.org.uk (Paul LeoNerd Evans)
Date: Thu, 29 Oct 2009 16:24:53 +0000
Local: Thurs, Oct 29 2009 12:24 pm
Subject: Re: Rule 1 has been invoked [Re: What should \s \w \d match in 5.12?]

On Thu, 29 Oct 2009 17:04:07 +0100

Abigail <abig...@abigail.be> wrote:
> Now, don't consider this an argument in favour of having \w match non-ASCII
> characters - but, IMO, if \w can match non-ASCII characters, so should \d.

This would seem to make the most sense, and be the most predictable.
Either all of them match Unicode, or none of them do.

If none of them do, then adding Unicode variations might be a nice idea.

I would suggest

word    digit   space
ASCII-only              \w      \d      \s
Includes Unicode        \Uw     \Ud     \Us

Only \U is already used. And \u.

Do we have a definitive list anywhere, on a tangential note, of the
remaining unused \x letters?

--
Paul "LeoNerd" Evans

leon...@leonerd.org.uk
ICQ# 4135350       |  Registered Linux# 179460
http://www.leonerd.org.uk/

More options Oct 29 2009, 12:26 pm
Newsgroups: perl.perl5.porters
From: davidni...@gmail.com (David Nicol)
Date: Thu, 29 Oct 2009 11:26:43 -0500
Local: Thurs, Oct 29 2009 12:26 pm
Subject: Re: Rule 1 has been invoked [Re: What should \s \w \d match in 5.12?]

On Thu, Oct 29, 2009 at 11:04 AM, Abigail <abig...@abigail.be> wrote:
> Now, don't consider this an argument in favour of having \w match non-ASCII
> characters - but, IMO, if \w can match non-ASCII characters, so should \d.

the constraint that anything that matches /\d+/ should numify to the
described number is a reasonable expectation. The alternative to
disallowing [^0-9] from \d is expanding numification to include
alternatvies. Creating a utof function that knows the values of all
the digitty characters from all scripts would require two steps
besides compiling the list. The first step is a big discussion to
decide how the edge cases, including but not limited to expressions
from mixed scripts, expressions from non-base-ten languages (why I
used cuneiform yesterday), expressions mixing scripts that mix
conventions, ambiguous expressions.

Should attempting to numify Ⅵ0 produce six or sixty or zero and a
warning or throw an exception but only under a new pragma and if so
how should that pragma be enabled, either via strict or autodie?

Should Ⅵ be expressible as ⅤⅠ?

If the base-ten atoi algorithm is simply applied based on unicode
numeric values, for instance, Ⅵ would be six but ⅤⅠ would be
sixty-one.

The second step, possible to do simultaneously with the ongoing first
step (the continuing proceedings of the working group on unicode to
numeric conversions in Perl) is implementing the decisions.

More options Oct 29 2009, 12:39 pm
Newsgroups: perl.perl5.porters
From: leon...@leonerd.org.uk (Paul LeoNerd Evans)
Date: Thu, 29 Oct 2009 16:39:39 +0000
Local: Thurs, Oct 29 2009 12:39 pm
Subject: Re: Rule 1 has been invoked [Re: What should \s \w \d match in 5.12?]

On Thu, 29 Oct 2009 16:24:53 +0000
Paul LeoNerd Evans <leon...@leonerd.org.uk> wrote:

> I would suggest

>                    word    digit   space
> ASCII-only         \w      \d      \s
> Includes Unicode   \Uw     \Ud     \Us

> Only \U is already used. And \u.

Actually then, on that note could we consider some more modifiers?

ASCII:   m/\w/a   m/\d/a   m/\s/a
Unicode: m/\w/u   m/\d/u   m/\s/u

and if neither is specified keep to the existing behaviour..

--
Paul "LeoNerd" Evans

leon...@leonerd.org.uk
ICQ# 4135350       |  Registered Linux# 179460
http://www.leonerd.org.uk/

More options Oct 29 2009, 12:40 pm
Newsgroups: perl.perl5.porters
From: abig...@abigail.be (Abigail)
Date: Thu, 29 Oct 2009 17:40:25 +0100
Local: Thurs, Oct 29 2009 12:40 pm
Subject: Re: Rule 1 has been invoked [Re: What should \s \w \d match in 5.12?]

25% is still available (13 out of 52 upper and lower case ASCII characters).

perlrebackslash.pod lists all \x letters in use. It's easy to deduce
the unused ones: \F, \i, \I, \j, \J, \m, \M, \o, \O, \q, \T, \y, \Y.

\c, \g, \k, \p, \P, \x are "partially available", that is, currently they
can only be followed by a limited set of characters, so there's some
room for expansion left.

\N is partially available in 5.10.x, but taken in blead.

Abigail

More options Oct 29 2009, 12:46 pm
Newsgroups: perl.perl5.porters
From: abig...@abigail.be (Abigail)
Date: Thu, 29 Oct 2009 17:46:55 +0100
Local: Thurs, Oct 29 2009 12:46 pm
Subject: Re: Rule 1 has been invoked [Re: What should \s \w \d match in 5.12?]

On Thu, Oct 29, 2009 at 11:26:43AM -0500, David Nicol wrote:
> On Thu, Oct 29, 2009 at 11:04 AM, Abigail <abig...@abigail.be> wrote:

> > Now, don't consider this an argument in favour of having \w match non-ASCII
> > characters - but, IMO, if \w can match non-ASCII characters, so should \d.

> the constraint that anything that matches /\d+/ should numify to the
> described number is a reasonable expectation.

OTOH, not everything that numifies matches /\d+/.

>                                               The alternative to
> disallowing [^0-9] from \d is expanding numification to include
> alternatvies. Creating a utof function that knows the values of all
> the digitty characters from all scripts would require two steps
> besides compiling the list. The first step is a big discussion to
> decide how the edge cases, including but not limited to expressions
> from mixed scripts, expressions from non-base-ten languages (why I
> used cuneiform yesterday), expressions mixing scripts that mix
> conventions, ambiguous expressions.

> Should attempting to numify Ⅵ0 produce six or sixty or zero and a
> warning or throw an exception but only under a new pragma and if so
> how should that pragma be enabled, either via strict or autodie?

Roman numberals are classified as "Number Letter" in the Unicode database,
and hence don't match /\d+/, which matches *digits*.

Abigail

More options Oct 29 2009, 12:52 pm
Newsgroups: perl.perl5.porters
From: abig...@abigail.be (Abigail)
Date: Thu, 29 Oct 2009 17:52:08 +0100
Local: Thurs, Oct 29 2009 12:52 pm
Subject: Re: Rule 1 has been invoked [Re: What should \s \w \d match in 5.12?]

And then the 'short hand' for "[0-9]" becomes "(?a:\d)".  ;-)

/(?u-a:\d)/:  match any non-ASCII digit?

Abigail

More options Oct 29 2009, 1:01 pm
Newsgroups: perl.perl5.porters
From: pub...@khwilliamson.com (Karl Williamson)
Date: Thu, 29 Oct 2009 11:01:35 -0600
Local: Thurs, Oct 29 2009 1:01 pm
Subject: Re: Rule 1 has been invoked [Re: What should \s \w \d match in 5.12?]

Non-unihan Unicode has three types of numbers.  Decimal digit, other
digit, and other numeric. \d only matches decimal digits, which is the
"right" thing in my mind.  "other digits" are like superscript 1.  I
think it is a reasonable argument to make that \d shouldn't match
anything that you can't numify automatically; I don't think it is a good
idea to have it match superscripts nor roman numerals, nor fractions.

We could extend numification to handle any or all Unicode code points
that have a numeric value.  But I don't think \d should match anything
more than decimal digits.  There is a CJK ideograph that means 10**12.
People are expecting \d to match a single digit.

More options Oct 29 2009, 2:36 pm
Newsgroups: perl.perl5.porters
From: davidni...@gmail.com (David Nicol)
Date: Thu, 29 Oct 2009 13:36:14 -0500
Local: Thurs, Oct 29 2009 2:36 pm
Subject: Re: Rule 1 has been invoked [Re: What should \s \w \d match in 5.12?]
On Thu, Oct 29, 2009 at 12:01 PM, karl williamson

> Non-unihan Unicode has three types of numbers.  Decimal digit, other digit,
> and other numeric. \d only matches decimal digits, which is the "right"
> thing in my mind.  "other digits" are like superscript 1.  I think it is a
> reasonable argument to make that \d shouldn't match anything that you can't
> numify automatically; I don't think it is a good idea to have it match
> superscripts nor roman numerals, nor fractions.

Oh. I left out the possibility of having nonsensical mixed expressions
return NaN after they warn, in case someone trying to implement

> We could extend numification to handle any or all Unicode code points that
> have a numeric value.  But I don't think \d should match anything more than
> decimal digits.  There is a CJK ideograph that means 10**12. People are
> expecting \d to match a single digit.

so if \d only means [0-9] plus various other kinds of [0-9] in
different writing systems, and doesn't include, for instance, [①-⒛],
without changing the semantics any, utoi and utof pretty much write
themselves. What's the range of characters permissible for the point
in floating point and should \. match them too, if there are more?

More options Oct 29 2009, 3:06 pm
Newsgroups: perl.perl5.porters
From: pub...@khwilliamson.com (Karl Williamson)
Date: Thu, 29 Oct 2009 13:06:46 -0600
Local: Thurs, Oct 29 2009 3:06 pm
Subject: Re: Rule 1 has been invoked [Re: What should \s \w \d match in 5.12?]

The decimal point is locale dependent and not specified in the Unicode
standard, but they have a CLDR (Common Locale Data Repository) project
that I believe contains that info.

More options Oct 29 2009, 3:11 pm
Newsgroups: perl.perl5.porters
From: tchr...@perl.com (Tom Christiansen)
Date: Thu, 29 Oct 2009 13:11:46 -0600
Local: Thurs, Oct 29 2009 3:11 pm
Subject: Re: Rule 1 has been invoked [Re: What should \s \w \d match in 5.12?]

Karl wrote:
> Non-unihan Unicode has three types of numbers.  Decimal digit, other
> digit, and other numeric. \d only matches decimal digits, which is the
> "right" thing in my mind.  "other digits" are like superscript 1.  I
> think it is a reasonable argument to make that \d shouldn't match
> anything that you can't numify automatically; I don't think it is a good
> idea to have it match superscripts nor roman numerals, nor fractions.
> We could extend numification to handle any or all Unicode code points
> that have a numeric value.  But I don't think \d should match anything
> more than decimal digits.  There is a CJK ideograph that means 10**12.
> People are expecting \d to match a single digit.

There may be issues even with just that.  What's a "digit",
really?  Some code points that by name call themselves DIGITs
are \d, but some calling themselves DIGITs aren't--they're \D,
like all the counting-rod digits:

\D U+1d360 COUNTING ROD UNIT DIGIT ONE
\D U+1d369 COUNTING ROD TENS DIGIT ONE

True, those seem no great loss to ignore, but I'm not so sure
all others are.  Some scripts' DIGITs are actually mixed \d
and \D, like:

\d U+00f21 TIBETAN DIGIT ONE
\D U+00f2a TIBETAN DIGIT HALF ONE

\d U+00c67 TELUGU DIGIT ONE
\D U+00c79 TELUGU FRACTION DIGIT ONE FOR ODD POWERS OF FOUR

I imagine, but do not know, that numbers in those scripts might
be composed of a mixture of \d and \D DIGITs.

Ethiopic DIGITs are *all* \D, and read left to right:

\D L   U+01369 ETHIOPIC DIGIT ONE

Whereas all Kharoshthi DIGITs are also \D, but read right to left:

\D R   U+10a40 KHAROSHTHI DIGIT ONE

You'd think it'd be less confusing just within say, European
Numbers, but it isn't that obviously so.  Our "1", the Arabic
numeral digit one, is of \p{BidiClass:EN}, a European Number
not an Arabic one:

\d EN  U+00031 DIGIT ONE

While Arabic-Indic's digit one is instead \p{BidiClass:AN}, an
Arabic numeral that indeed counts as an Arabic Number:

\d AN  U+00661 ARABIC-INDIC DIGIT ONE

Yet the *extended* Arabic-Indic's digit one has swapped back to
being a European Number again:

\d EN  U+006f1 EXTENDED ARABIC-INDIC DIGIT ONE

If automagic atod()ish nummification [hmm: or numefication?*]
for digit-strings is the goal, I don't know how to handle digit
strings composed of digits from various scripts.  Besides the \d
\D issues above, even when restricted to \d digits, directional
concerns remain.

Most go left to right:

\d L   U+009e7 BENGALI DIGIT ONE
\d L   U+00e51 THAI DIGIT ONE
\d L   U+00f21 TIBETAN DIGIT ONE

But Nko digits, which are real \d digits, are written right to left:

\d R   U+007c1 NKO DIGIT ONE

I think it's probably prudent to avoid most or all non-\D
numbers, like subscripts and superscripts, even when those
count as European Numbers:

\D EN  U+000b9 SUPERSCRIPT ONE
\D EN  U+02081 SUBSCRIPT ONE
\D EN  U+02488 DIGIT ONE FULL STOP
\D ON  U+02460 CIRCLED DIGIT ONE
\D ON  U+0278a DINGBAT NEGATIVE CIRCLED SANS-SERIF DIGIT ONE

If somebody is asked to enter the year like for copyrights, then
2009 of course works perfectly fine, and so do some other scripts.
It seems too specific to the problem domain, not to mention just
plain dangerous, to try to cope with Roman numerals, whether
entered as MMIX or in the Unicode versions.

Other writing systems than Roman/Latin also use their letters
as numeric digits.  If the current Hebrew year is 2009 + 3360
= 5769, they'd then write that--from right to left--as 769,
dropping the 5000 by convention, this way:

\D R   U+005ea HEBREW LETTER TAV (number 400)
\D R   U+005e9 HEBREW LETTER SHIN (number 300)
\D R   U+005e1 HEBREW LETTER SAMEKH (number 60)
\D R   U+005d8 HEBREW LETTER TET (number 9)

Actually, order doesn't really matter: it's not positional, just
accumulative.  Since we'd boot the Romans, I guess we'd boot the
Hebrews, too; don't see any way around it, really.

Just a few conundra I've been tossing around.

--tom
--

*  numinal: divine.

numinous: of or pertaining to a numen; divine, spiritual,
revealing or suggesting the presence of a god; inspiring
awe and reverence.  Hence numiosity, numinousness, numinously.

numify: to apotheosize.

More options Oct 29 2009, 3:18 pm
Newsgroups: perl.perl5.porters
From: m...@mark.mielke.cc (Mark Mielke)
Date: Thu, 29 Oct 2009 15:18:03 -0400
Local: Thurs, Oct 29 2009 3:18 pm
Subject: Re: Rule 1 has been invoked [Re: What should \s \w \d match in 5.12?]
On 10/28/2009 06:50 AM, Juerd Waalboer wrote:

Lots of good points about the breakage of utf-8 vs non-utf-8. This
should be fixed, but it's not really related to concerns I have about
\d. I think you and I agree that users should not need to know whether
string is utf-8 or non-utf-8. All Perl operations should, if at all
possible, operate the same, no matter what internal form it happens to
use to encode the string. It is a leaky abstraction for every exception
to this rule. It's bad, and it has prevented Perl from successfully
reaching the state of "embracing Unicode."

My concerns for \d at well covered by other people, even in posts today.
I have LOTS of code that uses \d written over the last 20 years, and it
is very concerning to me that this code which uses \d as a guard against
invalid input, may be accepting Unicode characters which cause my
programs to break, cause my programs to be exploitable over the network,
or cause data corruption.

I have a lot of code that does something like:

$number = /\A\d+\z/ ? 0+$_ :
die "...";

It scares me that this code may now be broken, or may become broken.

Was I wrong to use \d+? What should I have used? I've been taught to
avoid [0-9] in all languages since before I learned Perl, due to silly
character sets like EBCDIC. Now it seems like my only choice is:

$number = /\A[0123456789]+\z/ ? 0 +$_ :
die "...";

To me, that's just insanity.

I don't think \d should match anything that would allow /\A\d+\z/ to
result in a value where ("$_" != 0+$0). Go ahead and make 0+\$0 better,
or go ahead and make \d match only ASCII '0' through '9' - but anything
that causes this "identity" break is a BAD decision. Heck, even go ahead
and make a BAD decision - but the result will be that I recommend
against using Perl. I trust my programming language to do what I tell it
to. If it starts doing stupid things with each future release - I will
stop trusting it, and I will stop using it.

Cheers,
mark

--
Mark Mielke<m...@mielke.cc>

More options Oct 29 2009, 3:48 pm
Newsgroups: perl.perl5.porters
From: pub...@khwilliamson.com (Karl Williamson)
Date: Thu, 29 Oct 2009 13:48:42 -0600
Local: Thurs, Oct 29 2009 3:48 pm
Subject: Re: Rule 1 has been invoked [Re: What should \s \w \d match in 5.12?]

Here is a reference about Unicode security issues:
http://unicode.org/reports/tr36/

And here is an excerpt from that:

Turning away from the focus on domain names for a moment, there is
another area where visual spoofs can be used. Many scripts have sets of
decimal digits that are different in shape from the typical European
digits {0}. For example, Bengali has {০ ১  ২  ৩  ৪ ৫ ৬  ৭ ৮ ৯}, while
Oriya has {୦ ୧ ୨ ୩  ୪ ୫ ୬  ୭ ୮  ୯}. While the sets taken as a whole are
different in shape, individual digits may have the same shapes as digits
from other scripts, even digits of different values. For example, the
string  ৪୨ is visually confusable with 89 (at small sizes), but actually
has the numeric value 42. Where software interprets the numeric value of
a string of digits without detecting that the digits are from different
scripts, it is possible to generate such spoofs.

More options Oct 29 2009, 4:11 pm
Newsgroups: perl.perl5.porters
From: john.im...@vodafoneemail.co.uk (John)
Date: Thu, 29 Oct 2009 20:11:13 +0000
Local: Thurs, Oct 29 2009 4:11 pm
Subject: Re: Rule 1 has been invoked [Re: What should \s \w \d match in 5.12?]
Abigail wrote:
> If we want '\w+' to be able to match Klingon "words", why shouldn't \d+
> match Klingon numbers? Yes, \d+ matches digits from different scripts,
> but \w+ matches word characters from different scripts as well.

> Now, don't consider this an argument in favour of having \w match non-ASCII
> characters - but, IMO, if \w can match non-ASCII characters, so should \d.

> Abigail

The unicode character database file UnicodeData.txt contains in fields
6,7 and 8 the value of numeric characters. Could we not use that to
numeifiy characters such as Ⅲand Ⅸ so Ⅸ - Ⅳ == 5

John

More options Oct 29 2009, 4:19 pm
Newsgroups: perl.perl5.porters
From: john.im...@vodafoneemail.co.uk (John)
Date: Thu, 29 Oct 2009 20:19:42 +0000
Local: Thurs, Oct 29 2009 4:19 pm
Subject: Re: Rule 1 has been invoked [Re: What should \s \w \d match in 5.12?]

Abigail wrote:
> Roman numberals are classified as "Number Letter" in the Unicode database,
> and hence don't match /\d+/, which matches *digits*.

> Abigail

But Character.getNumericValue('Ⅳ') == 4 in Java script so there is a
numeric mapping for the roman numarals.

John

More options Oct 29 2009, 4:25 pm
Newsgroups: perl.perl5.porters
From: john.im...@vodafoneemail.co.uk (John)
Date: Thu, 29 Oct 2009 20:25:17 +0000
Local: Thurs, Oct 29 2009 4:25 pm
Subject: Re: Rule 1 has been invoked [Re: What should \s \w \d match in 5.12?]
karl williamson wrote:
> The decimal point is locale dependent and not specified in the Unicode
> standard, but they have a CLDR (Common Locale Data Repository) project
> that I believe contains that info.

I'm currently trying to create a Perl libry that will handle CLDR data
see http://github.com/ThePilgrim/perlcldr for more

John

More options Oct 29 2009, 4:34 pm
Newsgroups: perl.perl5.porters
From: pub...@khwilliamson.com (Karl Williamson)
Date: Thu, 29 Oct 2009 14:34:58 -0600
Local: Thurs, Oct 29 2009 4:34 pm
Subject: Re: Rule 1 has been invoked [Re: What should \s \w \d match in 5.12?]

If we decide to continue to allow \d to match non-ASCII, and I'm not
advocating that, it should only match decimal digits, regardless of the
names of the characters.

>     \d U+00f21 TIBETAN DIGIT ONE
>     \D U+00f2a TIBETAN DIGIT HALF ONE

Here is an example of a poor choice of name, or at least one that is
misunderstood by people.  The term DIGIT in Unicode means only that it
is a single character that has a numeric meaning, much like we refer to
the ASCII F as a hexadecimal digit.  So a DIGIT in Unicode doesn't have
to mean 0 through 9.

In this particular case, the HALF ONE means that this is 1 - .5 = .5,
which is the numeric value of the character.  So it is a digit that
means a non-integral number.  I think it should have been named ONE
MINUS HALF.  There is a HALF ZERO whose value is -.5.  I got curious a
while back as to why Tibetan of all languages in the world would have a
single character encoding the concept of -.5, so I looked it up on the
internet.  IIRC, The claim was that all these half digits are based on a
single Tibetan postage stamp of one of the values, and that the others
were inferred artificially based on Tibetan grammatical rules, and there
is no concrete evidence that they really ever existed except for one of
them on that one stamp.  There's a picture of it on the internet.

>     \d U+00c67 TELUGU DIGIT ONE
>     \D U+00c79 TELUGU FRACTION DIGIT ONE FOR ODD POWERS OF FOUR

> I imagine, but do not know, that numbers in those scripts might
> be composed of a mixture of \d and \D DIGITs.

U+0C79 also has numeric value one, but it appears to be more like a
remainder after taking a number mod 4, so I doubt that it stands on its own.

It seems we may be presuming a base 10 system inappropriately at times.

These classifications of or European/Arab number are solely for
implementing the Unicode Bidirectional Algorithm, and don't mean
anything beyond that.  Again, using the decimal digit type would be the
way to go, if we continue to go there.

> If automagic atod()ish nummification [hmm: or numefication?*]
> for digit-strings is the goal, I don't know how to handle digit
> strings composed of digits from various scripts.  Besides the \d
> \D issues above, even when restricted to \d digits, directional
> concerns remain.

> Most go left to right:

>     \d L   U+009e7 BENGALI DIGIT ONE
>     \d L   U+00e51 THAI DIGIT ONE
>     \d L   U+00f21 TIBETAN DIGIT ONE

> But Nko digits, which are real \d digits, are written right to left:

>     \d R   U+007c1 NKO DIGIT ONE

These are some of the reasons I'm leery of allowing \d to match outside
ASCII by default.  I don't get the symmetry argument of Abigail's.  A
number of people have responded recently about how they're surprised at
how it really works now.  Much code has been written that assumes that
\d is [0-9], and that digits are part of a base 10 number written
left-to-right.  I'm not sure right now if there is a way for a program
that doesn't use Encode or the command line options or I/O layers to get
Unicode data unexpectedly.  But even if they are, Unicode is a large
beast, and someone may be getting more than they bargained for.