Yesterday I read in perl5110delta that \s, \w and \d would change to ascii-only. I thought this was a bad idea for several distinct reasons, and briefly discussed the change on IRC and Twitter, but not perl5-porters, primarily because a discussion on this list often goes on for a while. (To convince a group of highly knowledgeable people is incredibly energy consuming; I'm not doing that anymore.)
Today I found perl5111delta; somehow I had failed to notice that the changes were already reverted.
By the way, I asked in #perl6 about the direction for Perl 6, not knowing that Larry would feel strongly about it, and not knowing he'd invoke Rule 1. However, I'm glad that he did.
Now with "Rule 1" invoked and the changes already reverted, I feel confident enough that I can post my thoughts here without the pressure of having to convince anyone, or thinking I should.
For me, it's not just about embracing Unicode.
Ignoring Perl 5.11.0, there's a clear bug in Unicode capable Perl 5s regarding string semantics. It took a while to reach concensus that this is indeed a design bug, a broken abstraction: the semantics of several operations are dependent on the internal representation of otherwise indistinguishable values.
Specifically, several operations take only codepoints in the ASCII range into account when the internal encoding of the operand string is not UTF-8. This applies to built-in character classes and their shortcuts and for functionality that deals with letter case (uc, lc, /i).
The historical behavior was to match only ASCII, but Unicode support was added later. Most people remained ignorant about this change and its implications. But anyone who is aware of the added Unicode support is bitten by the hard to predict distinction between two sets of semantics.
Now, it's fairly simple to force Perl to use only the Unicode semantics. Just utf8::upgrade the string and these operators and the regex engine will behave predictably. And that's what I told people to do. Upgrade your strings until Perl is fixed. I've told IRC, I've told YAPC and workshop audiences, Perl Monks, and readers of the now-official perlunifaq. I've heard others echo the advice and I've seen lots of utf8::upgrade's in the wild.
Throughout the years I have always assumed that Perl would be fixed by abandoning the ASCII-only behaviour, embracing Unicode as the default as this had been the direction ever since 5.6. This assumption is reflected in much of my writings and several talks on the subject.
It became painfully clear that a fix could never be made fully backward compatible (except if the fix was enabled conditionally), but the nice thing about the utf8::upgrade workaround is that you can repair your existing code in a way that will continue to work even after the bug is fixed. You can add the workaround to your code now, upgrade to 5.12 later, and then wait a few years before removing the calls to utf8::upgrade or just leave them there. Even if utf8::upgrade were ever removed from Perl, it'd be trivial to make it a no-op. It is safe to add utf8::upgrade and be fairly certain that your code will continue to function as it does today. (Modulo only some property changes in the Unicode spec; this causing real problems is very rare.)
That is, until 5.11.0 introduced intentional regression. While ASCII-only has worked well in the past, and may in specific circumstances even make more sense in terms of performance and security, I've called going back to this is an insanely bad idea. Larry agreed, noting that it strays from the path of gaining on Unicode and that it is poor huffman coding.
But as I said, to me it is not just about embracing Unicode. It's also about compatability. I agree that this is one of the incredibly rare occasions where it's acceptable and maybe even necessary to break backward compatibility. Going to Unicode-only means that the breakage can be controlled by programmers. They can add utf8::upgrade statements before upgrading perl, to make their code forward compatible. It provides a way to prepare for the hefty change, and many have already gone through their codebases looking for places to add this workaround.
Going (back) to ASCII-only, however, would not provide such a clean upgrade path. There is no way to make your 5.10 code forward compatible with 5.11.0, regarding \[dws], because there is no way to disable matching out-of-ASCII-range characters in Perl 5.10. So the only way to ensure a clean upgrade is to go through your code removing all uses of \d, \w, and \s that could possibly match Unicode characters. (Which would be extra painful for everyone who had already gone through it to add utf8::upgrade calls!)
(Granted, there is a way that forces ASCII-only semantics but itbreaks the whole flow of "receive, decode, process, encode, send": just encode the string to UTF8 temporarily, do your match, and decode again.)
So I'm glad that 5.11.1 comes with renewed sanity, and I hope that Larry's Rule 1 invocation will prevent the ASCII-only thing from happening again. Perl 5.12 string semantics must be the same as Perl 5.10 semantics on utf8::upgrade'd strings; everything else should only happen if explicitly requested by the programmer, preferrably as lexically local as possible.
In the past I have suggested adding a /a flag to the regular expression engine. (Blissfully unaware of how hard this would be to implement.) It would be useful for those cases (mostly sysadmin work) where you want to match only ASCII characters. It'd have to be a flag instead of a pragma, so it survives in qr, and so it can be negated in a subregex. I still believe that such a flag could be useful. (But I absolutely do not insist on having it.) -- Met vriendelijke groet, Kind regards, Korajn salutojn,
Mark Mielke wrote: > I think chief justice as you call it > should be an elected term position
Democracy has nothing to do with this, and we must make sure that it never will. Consensus between involved people, and pro-activity; all else needs mostly to be ignored.
Yuval Kogman <nothingm...@woobling.org> wrote: > How difficult would it be to introduce special chars which aren't > charclasses, which are probably more suitable for what people want anyway > (things that agree with grok_number, with rules for natural numbers, > integers, decimal fractions, and floating point notation)?
> Seems like the distinction between matching a character that is a digit vs. > matching ascii digits is mostly about what you do with the numbers > afterwords. Perhaps it's better to just remove the extra duplication?
Vim uses foo vs \_foo to distinguish whether a linefeed is included or not; e.g.
abc. <= literal followed by anything except linefeed abc\. <= literal followed by anything including linefeed
Maybe we can find some suitable mangling to apply to \w, \d, \s, etc... to say "with extra Unicode chars like these"
> > > As for \d, though, I am horrified to think how much bad behavior could be > > > introduced if \d started to match TITLE CASE KLINGON NUMERAL CHORGH
> > > I think it is likely that I would not upgrade to a perl5 that introduced > > > such behavior. "Review every regex that uses \d" is not an acceptable > > > burden.
> > What you just described is the present situation. And many people have > > this bug and have done exactly what you said.
> I stand both corrected and astonished.
If you're going to make \d match non-ASCII then please make this work
>> > > As for \d, though, I am horrified to think how much bad behavior could be >> > > introduced if \d started to match TITLE CASE KLINGON NUMERAL CHORGH
>> > > I think it is likely that I would not upgrade to a perl5 that introduced >> > > such behavior. "Review every regex that uses \d" is not an acceptable >> > > burden.
>> > What you just described is the present situation. And many people have >> > this bug and have done exactly what you said.
>> I stand both corrected and astonished.
> If you're going to make \d match non-ASCII then please make this work
> m/^\d+$/ and $count = $_+0;
It already matches non-ascii.
And no it doesnt work. And frankly the idea of making it work doesnt make a lot of sense to me.
You really want "\x{0E50}\x{0ED0}" to be another way to write "11"?
On Thu, Oct 29, 2009 at 12:46:30PM +0000, Paul LeoNerd Evans wrote: > On Wed, 28 Oct 2009 09:25:30 +0200 > Yuval Kogman <nothingm...@woobling.org> wrote:
> > How difficult would it be to introduce special chars which aren't > > charclasses, which are probably more suitable for what people want anyway > > (things that agree with grok_number, with rules for natural numbers, > > integers, decimal fractions, and floating point notation)?
> > Seems like the distinction between matching a character that is a digit vs. > > matching ascii digits is mostly about what you do with the numbers > > afterwords. Perhaps it's better to just remove the extra duplication?
> Vim uses foo vs \_foo to distinguish whether a linefeed is included or > not; e.g.
> abc. <= literal followed by anything except linefeed > abc\. <= literal followed by anything including linefeed
> Maybe we can find some suitable mangling to apply to \w, \d, \s, etc... > to say "with extra Unicode chars like these"
\begin{not-really-serious}
I suggest \ḋ, \ṡ, and \ẇ for the Unicode character classes, and \d, \s, \w for the ASCII versions.
For those not able to read my suggestions, it's
\N{LATIN SMALL LETTER D WITH DOT ABOVE} \x{1E0B} \N{LATIN SMALL LETTER S WITH DOT ABOVE} \x{1E61} \N{LATIN SMALL LETTER W WITH DOT ABOVE} \x{1E87}
> On Wed, 28 Oct 2009 09:25:30 +0200 > Yuval Kogman <nothingm...@woobling.org> wrote:
> > How difficult would it be to introduce special chars which aren't > > charclasses, which are probably more suitable for what people want anyway > > (things that agree with grok_number, with rules for natural numbers, > > integers, decimal fractions, and floating point notation)?
> > Seems like the distinction between matching a character that is a digit vs. > > matching ascii digits is mostly about what you do with the numbers > > afterwords. Perhaps it's better to just remove the extra duplication?
> Vim uses foo vs \_foo to distinguish whether a linefeed is included or > not; e.g.
> abc. <= literal followed by anything except linefeed > abc\. <= literal followed by anything including linefeed
Sorry; I meant
abc\_.
> Maybe we can find some suitable mangling to apply to \w, \d, \s, etc... > to say "with extra Unicode chars like these"
> >> > > As for \d, though, I am horrified to think how much bad behavior could be > >> > > introduced if \d started to match TITLE CASE KLINGON NUMERAL CHORGH
> >> > > I think it is likely that I would not upgrade to a perl5 that introduced > >> > > such behavior. "Review every regex that uses \d" is not an acceptable > >> > > burden.
> >> > What you just described is the present situation. And many people have > >> > this bug and have done exactly what you said.
> >> I stand both corrected and astonished.
> > If you're going to make \d match non-ASCII then please make this work
> > m/^\d+$/ and $count = $_+0;
> It already matches non-ascii.
> And no it doesnt work. And frankly the idea of making it work doesnt > make a lot of sense to me.
> You really want "\x{0E50}\x{0ED0}" to be another way to write "11"?
> They arent even in the same script.
Well, one of the reasons for \w to match more than ASCII characters (first with locale, later with Unicode) was that it should be possible to process 'words' in foreign scripts as well.
If we want '\w+' to be able to match Klingon "words", why shouldn't \d+ match Klingon numbers? Yes, \d+ matches digits from different scripts, but \w+ matches word characters from different scripts as well.
Now, don't consider this an argument in favour of having \w match non-ASCII characters - but, IMO, if \w can match non-ASCII characters, so should \d.
Abigail <abig...@abigail.be> wrote: > Now, don't consider this an argument in favour of having \w match non-ASCII > characters - but, IMO, if \w can match non-ASCII characters, so should \d.
This would seem to make the most sense, and be the most predictable. Either all of them match Unicode, or none of them do.
If none of them do, then adding Unicode variations might be a nice idea.
I would suggest
word digit space ASCII-only \w \d \s Includes Unicode \Uw \Ud \Us
Only \U is already used. And \u.
Do we have a definitive list anywhere, on a tangential note, of the remaining unused \x letters?
On Thu, Oct 29, 2009 at 11:04 AM, Abigail <abig...@abigail.be> wrote: > Now, don't consider this an argument in favour of having \w match non-ASCII > characters - but, IMO, if \w can match non-ASCII characters, so should \d.
the constraint that anything that matches /\d+/ should numify to the described number is a reasonable expectation. The alternative to disallowing [^0-9] from \d is expanding numification to include alternatvies. Creating a utof function that knows the values of all the digitty characters from all scripts would require two steps besides compiling the list. The first step is a big discussion to decide how the edge cases, including but not limited to expressions from mixed scripts, expressions from non-base-ten languages (why I used cuneiform yesterday), expressions mixing scripts that mix conventions, ambiguous expressions.
Should attempting to numify Ⅵ0 produce six or sixty or zero and a warning or throw an exception but only under a new pragma and if so how should that pragma be enabled, either via strict or autodie?
Should Ⅵ be expressible as ⅤⅠ?
If the base-ten atoi algorithm is simply applied based on unicode numeric values, for instance, Ⅵ would be six but ⅤⅠ would be sixty-one.
The second step, possible to do simultaneously with the ongoing first step (the continuing proceedings of the working group on unicode to numeric conversions in Perl) is implementing the decisions.
On Thu, Oct 29, 2009 at 04:24:53PM +0000, Paul LeoNerd Evans wrote: > On Thu, 29 Oct 2009 17:04:07 +0100 > Abigail <abig...@abigail.be> wrote:
> > Now, don't consider this an argument in favour of having \w match non-ASCII > > characters - but, IMO, if \w can match non-ASCII characters, so should \d.
> This would seem to make the most sense, and be the most predictable. > Either all of them match Unicode, or none of them do.
> If none of them do, then adding Unicode variations might be a nice idea.
> I would suggest
> word digit space > ASCII-only \w \d \s > Includes Unicode \Uw \Ud \Us
> Only \U is already used. And \u.
> Do we have a definitive list anywhere, on a tangential note, of the > remaining unused \x letters?
25% is still available (13 out of 52 upper and lower case ASCII characters).
perlrebackslash.pod lists all \x letters in use. It's easy to deduce the unused ones: \F, \i, \I, \j, \J, \m, \M, \o, \O, \q, \T, \y, \Y.
\c, \g, \k, \p, \P, \x are "partially available", that is, currently they can only be followed by a limited set of characters, so there's some room for expansion left.
\N is partially available in 5.10.x, but taken in blead.
On Thu, Oct 29, 2009 at 11:26:43AM -0500, David Nicol wrote: > On Thu, Oct 29, 2009 at 11:04 AM, Abigail <abig...@abigail.be> wrote:
> > Now, don't consider this an argument in favour of having \w match non-ASCII > > characters - but, IMO, if \w can match non-ASCII characters, so should \d.
> the constraint that anything that matches /\d+/ should numify to the > described number is a reasonable expectation.
OTOH, not everything that numifies matches /\d+/.
> The alternative to > disallowing [^0-9] from \d is expanding numification to include > alternatvies. Creating a utof function that knows the values of all > the digitty characters from all scripts would require two steps > besides compiling the list. The first step is a big discussion to > decide how the edge cases, including but not limited to expressions > from mixed scripts, expressions from non-base-ten languages (why I > used cuneiform yesterday), expressions mixing scripts that mix > conventions, ambiguous expressions.
> Should attempting to numify Ⅵ0 produce six or sixty or zero and a > warning or throw an exception but only under a new pragma and if so > how should that pragma be enabled, either via strict or autodie?
Roman numberals are classified as "Number Letter" in the Unicode database, and hence don't match /\d+/, which matches *digits*.
>> and if neither is specified keep to the existing behaviour..
> I was just having a similar thought. Although I do not think we would > need both as it should be one or the other. With Unicode as the default > we would only need /a. With this qr// patterns would also carry it with > them
And then the 'short hand' for "[0-9]" becomes "(?a:\d)". ;-)
David Nicol wrote: > On Thu, Oct 29, 2009 at 11:04 AM, Abigail <abig...@abigail.be> wrote:
>> Now, don't consider this an argument in favour of having \w match non-ASCII >> characters - but, IMO, if \w can match non-ASCII characters, so should \d.
> the constraint that anything that matches /\d+/ should numify to the > described number is a reasonable expectation. The alternative to > disallowing [^0-9] from \d is expanding numification to include > alternatvies. Creating a utof function that knows the values of all > the digitty characters from all scripts would require two steps > besides compiling the list. The first step is a big discussion to > decide how the edge cases, including but not limited to expressions > from mixed scripts, expressions from non-base-ten languages (why I > used cuneiform yesterday), expressions mixing scripts that mix > conventions, ambiguous expressions.
> Should attempting to numify Ⅵ0 produce six or sixty or zero and a > warning or throw an exception but only under a new pragma and if so > how should that pragma be enabled, either via strict or autodie?
> Should Ⅵ be expressible as ⅤⅠ?
> If the base-ten atoi algorithm is simply applied based on unicode > numeric values, for instance, Ⅵ would be six but ⅤⅠ would be > sixty-one.
> The second step, possible to do simultaneously with the ongoing first > step (the continuing proceedings of the working group on unicode to > numeric conversions in Perl) is implementing the decisions.
Non-unihan Unicode has three types of numbers. Decimal digit, other digit, and other numeric. \d only matches decimal digits, which is the "right" thing in my mind. "other digits" are like superscript 1. I think it is a reasonable argument to make that \d shouldn't match anything that you can't numify automatically; I don't think it is a good idea to have it match superscripts nor roman numerals, nor fractions.
We could extend numification to handle any or all Unicode code points that have a numeric value. But I don't think \d should match anything more than decimal digits. There is a CJK ideograph that means 10**12. People are expecting \d to match a single digit.
> Non-unihan Unicode has three types of numbers. Decimal digit, other digit, > and other numeric. \d only matches decimal digits, which is the "right" > thing in my mind. "other digits" are like superscript 1. I think it is a > reasonable argument to make that \d shouldn't match anything that you can't > numify automatically; I don't think it is a good idea to have it match > superscripts nor roman numerals, nor fractions.
Oh. I left out the possibility of having nonsensical mixed expressions return NaN after they warn, in case someone trying to implement Text::Numeric::Any reads this thread some day.
> We could extend numification to handle any or all Unicode code points that > have a numeric value. But I don't think \d should match anything more than > decimal digits. There is a CJK ideograph that means 10**12. People are > expecting \d to match a single digit.
so if \d only means [0-9] plus various other kinds of [0-9] in different writing systems, and doesn't include, for instance, [①-⒛], without changing the semantics any, utoi and utof pretty much write themselves. What's the range of characters permissible for the point in floating point and should \. match them too, if there are more?
David Nicol wrote: > On Thu, Oct 29, 2009 at 12:01 PM, karl williamson >> Non-unihan Unicode has three types of numbers. Decimal digit, other digit, >> and other numeric. \d only matches decimal digits, which is the "right" >> thing in my mind. "other digits" are like superscript 1. I think it is a >> reasonable argument to make that \d shouldn't match anything that you can't >> numify automatically; I don't think it is a good idea to have it match >> superscripts nor roman numerals, nor fractions.
> Oh. I left out the possibility of having nonsensical mixed expressions > return NaN after they warn, in case someone trying to implement > Text::Numeric::Any reads this thread some day.
>> We could extend numification to handle any or all Unicode code points that >> have a numeric value. But I don't think \d should match anything more than >> decimal digits. There is a CJK ideograph that means 10**12. People are >> expecting \d to match a single digit.
> so if \d only means [0-9] plus various other kinds of [0-9] in > different writing systems, and doesn't include, for instance, [①-⒛], > without changing the semantics any, utoi and utof pretty much write > themselves. What's the range of characters permissible for the point > in floating point and should \. match them too, if there are more?
The decimal point is locale dependent and not specified in the Unicode standard, but they have a CLDR (Common Locale Data Repository) project that I believe contains that info.
Karl wrote: > Non-unihan Unicode has three types of numbers. Decimal digit, other > digit, and other numeric. \d only matches decimal digits, which is the > "right" thing in my mind. "other digits" are like superscript 1. I > think it is a reasonable argument to make that \d shouldn't match > anything that you can't numify automatically; I don't think it is a good > idea to have it match superscripts nor roman numerals, nor fractions. > We could extend numification to handle any or all Unicode code points > that have a numeric value. But I don't think \d should match anything > more than decimal digits. There is a CJK ideograph that means 10**12. > People are expecting \d to match a single digit.
There may be issues even with just that. What's a "digit", really? Some code points that by name call themselves DIGITs are \d, but some calling themselves DIGITs aren't--they're \D, like all the counting-rod digits:
\D U+1d360 COUNTING ROD UNIT DIGIT ONE \D U+1d369 COUNTING ROD TENS DIGIT ONE
True, those seem no great loss to ignore, but I'm not so sure all others are. Some scripts' DIGITs are actually mixed \d and \D, like:
\d U+00f21 TIBETAN DIGIT ONE \D U+00f2a TIBETAN DIGIT HALF ONE
\d U+00c67 TELUGU DIGIT ONE \D U+00c79 TELUGU FRACTION DIGIT ONE FOR ODD POWERS OF FOUR
I imagine, but do not know, that numbers in those scripts might be composed of a mixture of \d and \D DIGITs.
Ethiopic DIGITs are *all* \D, and read left to right:
\D L U+01369 ETHIOPIC DIGIT ONE
Whereas all Kharoshthi DIGITs are also \D, but read right to left:
\D R U+10a40 KHAROSHTHI DIGIT ONE
You'd think it'd be less confusing just within say, European Numbers, but it isn't that obviously so. Our "1", the Arabic numeral digit one, is of \p{BidiClass:EN}, a European Number not an Arabic one:
\d EN U+00031 DIGIT ONE
While Arabic-Indic's digit one is instead \p{BidiClass:AN}, an Arabic numeral that indeed counts as an Arabic Number:
\d AN U+00661 ARABIC-INDIC DIGIT ONE
Yet the *extended* Arabic-Indic's digit one has swapped back to being a European Number again:
\d EN U+006f1 EXTENDED ARABIC-INDIC DIGIT ONE
If automagic atod()ish nummification [hmm: or numefication?*] for digit-strings is the goal, I don't know how to handle digit strings composed of digits from various scripts. Besides the \d \D issues above, even when restricted to \d digits, directional concerns remain.
Most go left to right:
\d L U+009e7 BENGALI DIGIT ONE \d L U+00e51 THAI DIGIT ONE \d L U+00f21 TIBETAN DIGIT ONE
But Nko digits, which are real \d digits, are written right to left:
\d R U+007c1 NKO DIGIT ONE
I think it's probably prudent to avoid most or all non-\D numbers, like subscripts and superscripts, even when those count as European Numbers:
\D EN U+000b9 SUPERSCRIPT ONE \D EN U+02081 SUBSCRIPT ONE \D EN U+02488 DIGIT ONE FULL STOP \D ON U+02460 CIRCLED DIGIT ONE \D ON U+0278a DINGBAT NEGATIVE CIRCLED SANS-SERIF DIGIT ONE
If somebody is asked to enter the year like for copyrights, then 2009 of course works perfectly fine, and so do some other scripts. It seems too specific to the problem domain, not to mention just plain dangerous, to try to cope with Roman numerals, whether entered as MMIX or in the Unicode versions.
Other writing systems than Roman/Latin also use their letters as numeric digits. If the current Hebrew year is 2009 + 3360 = 5769, they'd then write that--from right to left--as 769, dropping the 5000 by convention, this way:
\D R U+005ea HEBREW LETTER TAV (number 400) \D R U+005e9 HEBREW LETTER SHIN (number 300) \D R U+005e1 HEBREW LETTER SAMEKH (number 60) \D R U+005d8 HEBREW LETTER TET (number 9)
Actually, order doesn't really matter: it's not positional, just accumulative. Since we'd boot the Romans, I guess we'd boot the Hebrews, too; don't see any way around it, really.
Just a few conundra I've been tossing around.
--tom --
* numinal: divine.
numinous: of or pertaining to a numen; divine, spiritual, revealing or suggesting the presence of a god; inspiring awe and reverence. Hence numiosity, numinousness, numinously.
> The historical behavior was to match only ASCII, but Unicode support was > added later. Most people remained ignorant about this change and its > implications. But anyone who is aware of the added Unicode support is > bitten by the hard to predict distinction between two sets of semantics.
> Now, it's fairly simple to force Perl to use only the Unicode semantics. > Just utf8::upgrade the string and these operators and the regex engine > will behave predictably. And that's what I told people to do. Upgrade > your strings until Perl is fixed. I've told IRC, I've told YAPC and > workshop audiences, Perl Monks, and readers of the now-official > perlunifaq. I've heard others echo the advice and I've seen lots of > utf8::upgrade's in the wild.
> Throughout the years I have always assumed that Perl would be fixed by > abandoning the ASCII-only behaviour, embracing Unicode as the default as > this had been the direction ever since 5.6. This assumption is > reflected in much of my writings and several talks on the subject.
Lots of good points about the breakage of utf-8 vs non-utf-8. This should be fixed, but it's not really related to concerns I have about \d. I think you and I agree that users should not need to know whether string is utf-8 or non-utf-8. All Perl operations should, if at all possible, operate the same, no matter what internal form it happens to use to encode the string. It is a leaky abstraction for every exception to this rule. It's bad, and it has prevented Perl from successfully reaching the state of "embracing Unicode."
My concerns for \d at well covered by other people, even in posts today. I have LOTS of code that uses \d written over the last 20 years, and it is very concerning to me that this code which uses \d as a guard against invalid input, may be accepting Unicode characters which cause my programs to break, cause my programs to be exploitable over the network, or cause data corruption.
I have a lot of code that does something like:
$number = /\A\d+\z/ ? 0+$_ : die "...";
It scares me that this code may now be broken, or may become broken.
Was I wrong to use \d+? What should I have used? I've been taught to avoid [0-9] in all languages since before I learned Perl, due to silly character sets like EBCDIC. Now it seems like my only choice is:
$number = /\A[0123456789]+\z/ ? 0 + $_ : die "...";
To me, that's just insanity.
I don't think \d should match anything that would allow /\A\d+\z/ to result in a value where ("$_" != 0+$0). Go ahead and make 0+$0 better, or go ahead and make \d match only ASCII '0' through '9' - but anything that causes this "identity" break is a BAD decision. Heck, even go ahead and make a BAD decision - but the result will be that I recommend against using Perl. I trust my programming language to do what I tell it to. If it starts doing stupid things with each future release - I will stop trusting it, and I will stop using it.
Mark Mielke wrote: > On 10/28/2009 06:50 AM, Juerd Waalboer wrote: >> The historical behavior was to match only ASCII, but Unicode support was >> added later. Most people remained ignorant about this change and its >> implications. But anyone who is aware of the added Unicode support is >> bitten by the hard to predict distinction between two sets of semantics.
>> Now, it's fairly simple to force Perl to use only the Unicode semantics. >> Just utf8::upgrade the string and these operators and the regex engine >> will behave predictably. And that's what I told people to do. Upgrade >> your strings until Perl is fixed. I've told IRC, I've told YAPC and >> workshop audiences, Perl Monks, and readers of the now-official >> perlunifaq. I've heard others echo the advice and I've seen lots of >> utf8::upgrade's in the wild.
>> Throughout the years I have always assumed that Perl would be fixed by >> abandoning the ASCII-only behaviour, embracing Unicode as the default as >> this had been the direction ever since 5.6. This assumption is >> reflected in much of my writings and several talks on the subject.
> Lots of good points about the breakage of utf-8 vs non-utf-8. This > should be fixed, but it's not really related to concerns I have about > \d. I think you and I agree that users should not need to know whether > string is utf-8 or non-utf-8. All Perl operations should, if at all > possible, operate the same, no matter what internal form it happens to > use to encode the string. It is a leaky abstraction for every exception > to this rule. It's bad, and it has prevented Perl from successfully > reaching the state of "embracing Unicode."
> My concerns for \d at well covered by other people, even in posts today. > I have LOTS of code that uses \d written over the last 20 years, and it > is very concerning to me that this code which uses \d as a guard against > invalid input, may be accepting Unicode characters which cause my > programs to break, cause my programs to be exploitable over the network, > or cause data corruption.
> I have a lot of code that does something like:
> $number = /\A\d+\z/ ? 0+$_ : > die "...";
> It scares me that this code may now be broken, or may become broken.
> Was I wrong to use \d+? What should I have used? I've been taught to > avoid [0-9] in all languages since before I learned Perl, due to silly > character sets like EBCDIC. Now it seems like my only choice is:
> I don't think \d should match anything that would allow /\A\d+\z/ to > result in a value where ("$_" != 0+$0). Go ahead and make 0+$0 better, > or go ahead and make \d match only ASCII '0' through '9' - but anything > that causes this "identity" break is a BAD decision. Heck, even go ahead > and make a BAD decision - but the result will be that I recommend > against using Perl. I trust my programming language to do what I tell it > to. If it starts doing stupid things with each future release - I will > stop trusting it, and I will stop using it.
Turning away from the focus on domain names for a moment, there is another area where visual spoofs can be used. Many scripts have sets of decimal digits that are different in shape from the typical European digits {0}. For example, Bengali has {০ ১ ২ ৩ ৪ ৫ ৬ ৭ ৮ ৯}, while Oriya has {୦ ୧ ୨ ୩ ୪ ୫ ୬ ୭ ୮ ୯}. While the sets taken as a whole are different in shape, individual digits may have the same shapes as digits from other scripts, even digits of different values. For example, the string ৪୨ is visually confusable with 89 (at small sizes), but actually has the numeric value 42. Where software interprets the numeric value of a string of digits without detecting that the digits are from different scripts, it is possible to generate such spoofs.
Abigail wrote: > If we want '\w+' to be able to match Klingon "words", why shouldn't \d+ > match Klingon numbers? Yes, \d+ matches digits from different scripts, > but \w+ matches word characters from different scripts as well.
> Now, don't consider this an argument in favour of having \w match non-ASCII > characters - but, IMO, if \w can match non-ASCII characters, so should \d.
> Abigail
The unicode character database file UnicodeData.txt contains in fields 6,7 and 8 the value of numeric characters. Could we not use that to numeifiy characters such as Ⅲand Ⅸ so Ⅸ - Ⅳ == 5
karl williamson wrote: > The decimal point is locale dependent and not specified in the Unicode > standard, but they have a CLDR (Common Locale Data Repository) project > that I believe contains that info.
>> Non-unihan Unicode has three types of numbers. Decimal digit, other >> digit, and other numeric. \d only matches decimal digits, which is the >> "right" thing in my mind. "other digits" are like superscript 1. I >> think it is a reasonable argument to make that \d shouldn't match >> anything that you can't numify automatically; I don't think it is a good >> idea to have it match superscripts nor roman numerals, nor fractions.
>> We could extend numification to handle any or all Unicode code points >> that have a numeric value. But I don't think \d should match anything >> more than decimal digits. There is a CJK ideograph that means 10**12. >> People are expecting \d to match a single digit.
> There may be issues even with just that. What's a "digit", > really? Some code points that by name call themselves DIGITs > are \d, but some calling themselves DIGITs aren't--they're \D, > like all the counting-rod digits:
> \D U+1d360 COUNTING ROD UNIT DIGIT ONE > \D U+1d369 COUNTING ROD TENS DIGIT ONE
> True, those seem no great loss to ignore, but I'm not so sure > all others are. Some scripts' DIGITs are actually mixed \d > and \D, like:
If we decide to continue to allow \d to match non-ASCII, and I'm not advocating that, it should only match decimal digits, regardless of the names of the characters.
> \d U+00f21 TIBETAN DIGIT ONE > \D U+00f2a TIBETAN DIGIT HALF ONE
Here is an example of a poor choice of name, or at least one that is misunderstood by people. The term DIGIT in Unicode means only that it is a single character that has a numeric meaning, much like we refer to the ASCII F as a hexadecimal digit. So a DIGIT in Unicode doesn't have to mean 0 through 9.
In this particular case, the HALF ONE means that this is 1 - .5 = .5, which is the numeric value of the character. So it is a digit that means a non-integral number. I think it should have been named ONE MINUS HALF. There is a HALF ZERO whose value is -.5. I got curious a while back as to why Tibetan of all languages in the world would have a single character encoding the concept of -.5, so I looked it up on the internet. IIRC, The claim was that all these half digits are based on a single Tibetan postage stamp of one of the values, and that the others were inferred artificially based on Tibetan grammatical rules, and there is no concrete evidence that they really ever existed except for one of them on that one stamp. There's a picture of it on the internet.
> \d U+00c67 TELUGU DIGIT ONE > \D U+00c79 TELUGU FRACTION DIGIT ONE FOR ODD POWERS OF FOUR
> I imagine, but do not know, that numbers in those scripts might > be composed of a mixture of \d and \D DIGITs.
U+0C79 also has numeric value one, but it appears to be more like a remainder after taking a number mod 4, so I doubt that it stands on its own.
It seems we may be presuming a base 10 system inappropriately at times.
> Ethiopic DIGITs are *all* \D, and read left to right:
> \D L U+01369 ETHIOPIC DIGIT ONE
> Whereas all Kharoshthi DIGITs are also \D, but read right to left:
> \D R U+10a40 KHAROSHTHI DIGIT ONE
> You'd think it'd be less confusing just within say, European > Numbers, but it isn't that obviously so. Our "1", the Arabic > numeral digit one, is of \p{BidiClass:EN}, a European Number > not an Arabic one:
> \d EN U+00031 DIGIT ONE
> While Arabic-Indic's digit one is instead \p{BidiClass:AN}, an > Arabic numeral that indeed counts as an Arabic Number:
> \d AN U+00661 ARABIC-INDIC DIGIT ONE
> Yet the *extended* Arabic-Indic's digit one has swapped back to > being a European Number again:
> \d EN U+006f1 EXTENDED ARABIC-INDIC DIGIT ONE
These classifications of or European/Arab number are solely for implementing the Unicode Bidirectional Algorithm, and don't mean anything beyond that. Again, using the decimal digit type would be the way to go, if we continue to go there.
> If automagic atod()ish nummification [hmm: or numefication?*] > for digit-strings is the goal, I don't know how to handle digit > strings composed of digits from various scripts. Besides the \d > \D issues above, even when restricted to \d digits, directional > concerns remain.
> Most go left to right:
> \d L U+009e7 BENGALI DIGIT ONE > \d L U+00e51 THAI DIGIT ONE > \d L U+00f21 TIBETAN DIGIT ONE
> But Nko digits, which are real \d digits, are written right to left:
> \d R U+007c1 NKO DIGIT ONE
These are some of the reasons I'm leery of allowing \d to match outside ASCII by default. I don't get the symmetry argument of Abigail's. A number of people have responded recently about how they're surprised at how it really works now. Much code has been written that assumes that \d is [0-9], and that digits are part of a base 10 number written left-to-right. I'm not sure right now if there is a way for a program that doesn't use Encode or the command line options or I/O layers to get Unicode data unexpectedly. But even if they are, Unicode is a large beast, and someone may be getting more than they bargained for.