in blead.
If I'm wrong about the agreement, I would like to start another
discussion, and my initial position is that they should only match in
the ASCII range.
Yes please. The ASCII versions are very commonly required, deserving of
a shorthand, and currently lack any abbreviated form at all. Matching
extendable sets of Unicode characters is a much less common requirement,
and can already be expressed in explicitly-Unicode-based ways.
-zefram
Im inclined to say it just slipped me by. Ill poke it with a stick
when i get a chance.
> If I'm wrong about the agreement, I would like to start another discussion,
> and my initial position is that they should only match in the ASCII range.
Agreed.
2009/9/30 Zefram <zef...@fysh.org>:
> Yes please. The ASCII versions are very commonly required, deserving of
> a shorthand, and currently lack any abbreviated form at all. Matching
> extendable sets of Unicode characters is a much less common requirement,
> and can already be expressed in explicitly-Unicode-based ways.
Yes i concur.
Yves
--
perl -Mre=debug -e "/just|another|perl|hacker/"
Just to be precise about it, I neglected to mention that my statement
was meant only to apply in the absence of a "use locale", and whatever
the base C library routines do on an EBCDIC system. I wasn't advocating
changing the behavior under those circumstances.
I toyed with a small piece of code and seems it's not working as
specified in delta anyway:
http://gist.github.com/200900
So apparently the delta is not correct, or delta is trying to specify
what *will* be changed but not done yet?
Anyway, I have tons of scripts that rely on \d matching Japanese
numbers and \s matches with full-width space etc. Being able to have a
pragma to enable/disable the new behavior would be very nice. (I
understand I can start rewriting those \d to like \p{IsDigit} to be
forward compatbile, though)
On Thu, Oct 1, 2009 at 6:09 PM, karl williamson <pub...@khwilliamson.com> wrote:
> demerphq wrote:
>>
>> 2009/9/30 karl williamson <pub...@khwilliamson.com>:
>>>
>>> I had thought in our discussion last year that we had determined that
>>> these
>>> should match only in the ASCII range. And so, I thought that when Yves
>>> flipped the switch on the \p{Posix} matches, that these would change as
>>> well, but that isn't the case:
>>> perl -E "say chr(0x2028) =~ /\s/"
>>> 1
>>>
>>> in blead.
>>
>> Im inclined to say it just slipped me by. Ill poke it with a stick
>> when i get a chance.
>>
>>> If I'm wrong about the agreement, I would like to start another
>>> discussion,
>>> and my initial position is that they should only match in the ASCII
>>> range.
>>
>> Agreed.
>
> Just to be precise about it, I neglected to mention that my statement was
> meant only to apply in the absence of a "use locale", and whatever the base
> C library routines do on an EBCDIC system. I wasn't advocating changing the
> behavior under those circumstances.
>
--
Tatsuhiko Miyagawa
Yes, the delta is not correct, but gives the current plan, so that
should be what happens.
> Anyway, I have tons of scripts that rely on \d matching Japanese
> numbers and \s matches with full-width space etc. Being able to have a
> pragma to enable/disable the new behavior would be very nice. (I
> understand I can start rewriting those \d to like \p{IsDigit} to be
> forward compatbile, though)
>
Note that the 'Is' is optional. The chart in the delta gives the
mappings for \s and \w as well. Note that if you can accept a vertical
tab in \s, that \p{Space} is shorter.
There are plans for a pragma for other unicode incompatibilities, and a
git branch that includes the beginnings of one: "use legacy". I had
thought that these changes could be controlled by a pragma, and I hope
that it is this one.
> There are plans for a pragma for other unicode incompatibilities, and a
> git branch that includes the beginnings of one: "use legacy". I had
> thought that these changes could be controlled by a pragma, and I hope
> that it is this one.
If the changes will be controlled by a pragma, what's the point of
forcing existing code to 'use legacy' rather than making these changes
part of 'use 5.12'?
We've always had a strong culture of not gratuitously breaking backwards
compatibility. This seems like a strange thing to choose to throw that
away on.
-Jesse
> If the changes will be controlled by a pragma, what's the point of
> forcing existing code to 'use legacy' rather than making these changes
> part of 'use 5.12'?
+1
--
Tatsuhiko Miyagawa
There has been plenty of discussion about this over the years. The
simple explanation is that the current scheme is broken in various ways.
In the things I'm working on fixing, the breakage is essentially that
the internal storage detail (utf8 or not) of strings changes their
external semantics. This has gone on for a long time, and it leads to
all sorts of unexpected results, for example when Perl decides for any
number of reasons to change the storage type of a string. However,
whenever any product is in the field long enough, people come to rely on
its bugs. So, we are planning to add a pragma for those relatively few
who rely on the broken behavior. For most, programs will actually work
more correctly.
For the \d, etc things, there are a number of arguments for changing
their behavior. The one I can think of right now, is that it currently
can be a security threat, in that most perl programs out there are not
expecting Unicode at all, and so having, eg., \d match not just 10
things but 411, or \w match not just 63 things but 101,685 can lead to
lots of unintended consequences. It seems better that the program has
to explicitly indicate that it is prepared to handle these expanded cases.
> There has been plenty of discussion about this over the years. The simple
> explanation is that the current scheme is broken in various ways. In the
> things I'm working on fixing, the breakage is essentially that the internal
> storage detail (utf8 or not) of strings changes their external semantics.
Yes, that has been really annoying and I appreciate your efforts to fix that.
> For the \d, etc things, there are a number of arguments for changing their
> behavior.
Yes :)
> The one I can think of right now, is that it currently can be a
> security threat, in that most perl programs out there are not expecting
> Unicode at all, and so having, eg., \d match not just 10 things but 411, or
> \w match not just 63 things but 101,685 can lead to lots of unintended
> consequences. It seems better that the program has to explicitly indicate
> that it is prepared to handle these expanded cases.
I understand what your policy is about this, but I see expressions
like "most perl programs" and "(The ASCII versions are) very commonly
required" (earlier in this thread) that kind of upsets me, because
that's not what I expect in most modern perl programs I write both
personally and at work.
--
Tatsuhiko Miyagawa
Let me outline the problem here a little.
With the schizo behaviour of Perl string semantics especially with
regard to the regex engine there is no way to fix some of the problems
without introducing breakage *somewhere*.
An example for instance is the behaviour of \d, or of [[:alpha:]]
which match different things "in unicode mode" as they do in
"non-unicode mode".
This completely breaks the charclass logic resulting in POSIX
charclasses and their negations under unicode matching the same
characters(!!!!!), amongst other intriguing bugs.
There is no way to fix these problems without changing the defined
behaviour of these constructs.
There are many many problems like this. For instance the *hell* that
\xDF causes because it matches 'ss' in unicode case insensitively, and
doesnt match anything in non-unicode.
So the plan is to fix this stuff so it is consistant, and deal with
any incidental breakage as we can.
With regard to backwards compatibility, I actually have NO plans to
introduce EITHER pragma OR Feature flags to enable the old behaviour.
The old behaviour is buggy, broken and internally inconsistant. We do
NOT provide flags to reenable old fixed bugs for anything else, why
should we do it with the regex engine?
Now, before the steam starts flowing from your ears, Ill let you in on
a little secret:
The user community can do this itself.
You see there has been support for overriding the content of a Regex
pattern prior to regex compilation for a very long time. Using this
infrastructure one can define drop in modules that override \d and \s
and whatever it is that people want to override, with *anything* they
want. So IMO this is a non-problem. People that really want \d to mean
\p{IsDigit} can just define a regex pattern filter to munge \d into
\p{IsDigit} and get the *sane* and predictable results they wanted.
Now if it turns out that what I describe above is impossible then i
will reconsider the subject of including legacy support for this at
the regex engine. However it has to be really really amazingly
impossible for me to go there.
Id just like to repeat a point here. WE CANT FIX THIS STUFF WITHOUT
BREAKING SOMETHING.
So we can either leave it broken for ever, or we can take the hit
sometime to deal with the underlying conceptual breakage, and that is
what i believe the plan is/was for 5.12.
I have to admit I worried about this a bit, but came to the conclusion
that likely
a) more people get bitten by \d including things it shouldnt than
b) people like you who really want \d to mean \p{IsDigit}.
I do regret that it might impact you personally, and hope that we can
get some drop in regex compilation filters in place so that you can do
something like:
use re::UnicodeShortForms;
and have \d mean \p{IsDigit}
However I feel very strongly that somebody has to take it in the neck
to get this fixed, and so while I sympathise with anyone negatively
impacted, I cant really do more. If its not you it will be someone
else, regarding something else.
The status quo cant be fixed with something giving, and on the balance
of things the area that impacts you the most seems like the are likely
to impact the least number of people. Maybe we will hear more noise to
the contrary, in which case my view might change, but right now I dont
see a feasable path forward without changing this area of things.
My humble apologies for potentially ruining your day. Id be happy to
work with you to assist in coming up with a reasonable workaround.
Please DO keep the feedback, even negative coming, we may have missed
more serious breakage that we CAN resolve in a backwards compatible
way.
cheers,
I've never seriously used this feature before but I did once pen a
lexical \w replacement to mean [A-Za-z'.-] because I was matching lots
of names. Below is Yves' suggestion for a community-solved problem.
Ought to work back to 5.6 too. Of course, this could work the *other*
way too. This turns \d => \p{IsDigit} but could also it into [0-9].
package re::UnicodeShortForms;
use 5.006;
use overload;
our %REPLACEMENTS = (
d => '\p{IsDigit}',
# w => ...
);
sub import { overload::constant( qr => \ &rewrite_regexp ) }
sub rewrite_regexp {
my ( undef, $text ) = @_;
$text =~ s{
\\
( d | . )
}{
$REPLACEMENTS{$1} || "\\$1"
}xge;
return $text;
}
'Josh'
For \d to mean "any digit" or for \d to mean [0-9] is both reasonable.
As long as \d remains on its own. The problem starts as soon as people
write something like /1\d/.
Surely they aren't expecting to match a 1 followed by an Thai 7.
Even a /\d+/ will often match more than people want, as it can happily
match a string of digits of different scripts. The fact that
if (/(\d+)/) {
$num += $1;
}
can lead to warnings and unexpected behaviour makes, IMO, \d rather useless.
Personally, I haven't used \d in many years. Not only matches it too much,
it will also match different characters depending on the Perl version. And
it matches "digits" that Perl doesn't know how to use in an arithmetic
expression.
I do use \w instead of [a-zA-Z_0-9] which is normally what I want to match -
but that's just me taking shortcuts; [a-zA-Z_0-9] is a bit long to type.
I would personally favour if \d becomes just a shortcut for [0-9], and \w a
shortcut for [a-zA-Z_0-9]. All the time. Regardless of internal encoding
of the subject string, the locale, or how the regexp itself is encoded.
Not until then I will stop recommending people to not use \d or \w.
Abigail
Adendum.
C is allready used. Argh!
So how about Z :-)
______________________________________________
This email has been scanned by Netintelligence
http://www.netintelligence.com/email
Hi all,
I would like to throw my two bits into the mix as a user, rather than a
developer of Perl.
I think if you want the unicode semantics and realy want to match all
unicode didgits you should be forced to write \p{Digit}
\d should, as Abigail staits, match only [0-9]
This should also hold for other shortcuts with pre-unicode definitions.
This does not help out with Locals, for which I propose a new set of
regex properties \C{Property Name}.
I picked 'C' as C tends to be the default Local
So /\Cw/ or /\C{Word}/ would match a worh character in the current Local.
John
______________________________________________
This email has been scanned by Netintelligence
> I would personally favour if \d becomes just a shortcut for [0-9], and \w a
> shortcut for [a-zA-Z_0-9]. All the time. Regardless of internal encoding
> of the subject string, the locale, or how the regexp itself is encoded.
>
I noticed you didn't touch \s, which is the one that troubles me (too?). I
often use \d and \w in patterns that are captured. It's good to match
tightly, so I agree with you. \s, on the other hand, matches parts of the
input I usually wish to discard. Having it behave laxly (i.e. match
characters such as NBSP) would benefit me.
- ELB
That’s a sore point for me. Even the fact that \s matches newline
often annoys me. I wish there was a shorthand for [ \t] which is
what I usually want when I use \s – though I often use \s anyway
for the brevity when it’s not a huge issue.
Regards,
--
Aristotle Pagaltzis // <http://plasmasturm.org/>
+1
\s only does what I mean some of the time today. Adding extra meaning to
it just means more places that it doesn't quite work. A lot of code does
line-based parsing first (even if just using while (<>)), and then uses
\s liberally as a "any white space except for newline." If \s starts
matching other kinds of newlines - this code will be more broken than it
is today.
I've mostly given up on \s. I frequently use [ \t] instead as well.
For \d, I would expect every situation where /\A(\d+)\z/ for (0+$1 ==
0+$_). I have not followed this thread very well - what does Perl do
(0+$_) if it encounters a string with unicode numbers?
For \w, this is used very strictly in lexers/parsers. I frequently use
it to match exactly 'A' - 'Z', 'a' - 'z', '0' - '9', and '_' (ASCII)
before passing the argument to an external program. If it starts passing
additional characters through, I can think of several external
applications that *will* break, because they don't understand unicode.
For \s, it's a big unusable at present, and changing the definition will
create more confusion and breakage than gain.
Changing \s, \w, and \d from their traditional meanings sounds dangerous.
My opinion.
What is the real gain here? That some applications will magically start
supporting additional unicode sequences and "just work"? That people can
type fewer regexp operands to get "new"-style behaviour? How many people
want this?
I suppose if it is "all non-English writers" my opinion might be
out-numbered. :-)
Cheers,
mark
--
Mark Mielke<ma...@mielke.cc>
As \s is currently [\h\v], perhaps you'd like "horizontal space" via \h:
U+0009 CHARACTER TABULATION
U+0020 SPACE
U+00A0 NO-BREAK SPACE
U+1680 OGHAM SPACE MARK
U+180E MONGOLIAN VOWEL SEPARATOR
U+2000 EN QUAD
U+2001 EM QUAD
U+2002 EN SPACE
U+2003 EM SPACE
U+2004 THREE-PER-EM SPACE
U+2005 FOUR-PER-EM SPACE
U+2006 SIX-PER-EM SPACE
U+2007 FIGURE SPACE
U+2008 PUNCTUATION SPACE
U+2009 THIN SPACE
U+200A HAIR SPACE
U+202F NARROW NO-BREAK SPACE
U+205F MEDIUM MATHEMATICAL SPACE
U+3000 IDEOGRAPHIC SPACE
(Ogham? Mongolian? Hmmm.)
Which doesn't include "vertical space" via \v:
U+000A LINE FEED (LF)
U+000B LINE TABULATION
U+000C FORM FEED (FF)
U+000D CARRIAGE RETURN (CR)
U+0085 NEXT LINE (NEL)
U+2028 LINE SEPARATOR
U+2029 PARAGRAPH SEPARATOR
--tom
Yes, that works - except it's not in 5.6.0 which we still use. :-(
Gah... why are large companies always stuck in the past?
Thanks for the suggestion.
Can you explain why you want the locale to influence things?
You see, the locale handling logic is basically a template of what we
dont want, in the regex engine, or in general. If i could get away
with it I would deprecate use locale, and all of the locale based
regops as they are major maintenance nightmare for IMO little benefit.
Once your text is stored as unicode you can define any properties you
wish. The regex engine/unicode infrastructure already has ways of
dealing with them and includes a comprehensive framework for dealing
with most everything we could want. What does locale give you? (Honest
question, I never use it, and have never seen a need to.)
\s troubles me less, because there isn't a equivalent issue to /1\d/ or
/\d+/. Furthermore, when one does want to seperate "types" of whitespace,
one usually wants horizontal and vertical whitespace. For which we have
/\v/ and /\h/. I'd like to see /\s/ fixed in the sense that it matches
a fixed set of characters, regardless of encoding. I don't really care
whether it restricts itself to ASCII only, or whether "\x0b" is included
or not - after all, \h, \v and [\h\v] are already very useful.
>
> As \s is currently [\h\v]
It's not.
$ perl -E 'say "\x0b" =~ /\s/ ? "Yes" : "No"'
No
$ perl -E 'say "\x0b" =~ /\v/ ? "Yes" : "No"'
Yes
And while "\x85" always matches /\v/, it only matches /\s/ under
UTF-8 matching. Similar for "\xA0", which always matches /\h/, but
has a UTF-8 matching dependency on whether it matches /\s/.
See also "man perlrecharclass".
Abigail
It seems you want to match horizontal whitespace. I use \h when I need
that, that will match space, tab, the no-break space, and a handful of
Unicode spaces.
Abigail
While not short names,
in 5.11 [[:Blank:]] and \p{PosixBlank} should match just TAB and SPACE.
I'm in agreement with mark. If you want unicode semantics for characters
that can be used in identifiers then you have \p{ID_Start} and
\p{ID_Continue}
\w shoud be its historical meaning
> U+180E MONGOLIAN VOWEL SEPARATOR
方法については、コンパイルのデフォルトを設定する時のオプションとの包括的なpragamataごとに、これらのすべての組み合わせを?宮川達彦と彼は書いすべてのPerlプログラムの上部に行を配置する必要はありませんその方法です。
Let's map each pragma to a compile-time option so Tatsuhiko Miyagawa
can declare his preferences at install time.
のインストール時に自分の嗜好を宣言できるようコンパイルするために、各プラグマ- timeオプション宮川達彦ので、地図をしましょう。
my $str = 'garçon';
use local 'fr';
print "Contains french exemplar characters" if $string=~/^\w+$/;
use local 'en';
print "Contains non english exemplar characters" unless $string=~/^\w+$/;
And then make it an IO filter as well. Then I could have 1000.00 be
rendered as 1,000.00 or 1.000,00 depending on the local.
Careful: wouldn't historical meaning include
locales, wherein \w would also include (for example)
� and � in French, � in Spanish, � in German,
and � and � in Icelandic? And didn't we already
find that locale-shifting char classes made
life really hard on the regex engine (at least)?
I don't know whether this is harder on it than
it already suffers under the Unicode vs bytes
shifts in behavior, but both seem problematic
to an annoying degree.
This is why my test program was tricked into
thinking \s suddenly started matching VT like
\v does, despite decades of historical precedent.
I'd forced it into Unicode mode. :(
--tom
--
"Toss no fish to hysterical porpoises."
use locale is in some respects broken by qr//, as it doesnt use regex
flags and depends on the context it is compiled within.
So for instance, if you use local and the have a sub return a qr//
compiled regex and then use that object alone in a match anywhere that
you pass it it will match using the semantics of the locale in effect
when it is matched. If the qr// is inserted in another pattern the
localeness of the pattern is destroyed.
In short qr// results compiled under use locale have different results
depending on how they are used. These regexes are also much slower
than ones not compiled under locale as they have to do a lot more run
time comparisons to check if they match.
> I don't know whether this is harder on it than
> it already suffers under the Unicode vs bytes
> shifts in behavior, but both seem problematic
> to an annoying degree.
Locale regexes are irritating because you cant precompute them. They
are defined to change based on your environment which can change in
between compilation and execution of the regex. So you delay a lot of
stuff that could be precomputed to inside of the regex matching loop.
> This is why my test program was tricked into
> thinking \s suddenly started matching VT like
> \v does, despite decades of historical precedent.
> I'd forced it into Unicode mode. :(
And this is why we really really want \w and \s and \d to match the
traditional thing, even if this means requiring people add something
to older scripts to support the legacy behaviour. You cant tell what a
pattern does by looking at it, you have to know the internal bit flags
of the string involved.
You can do it this way, not having to depend on whatever may be installed
on the system:
sub French {return <<"--"} # Might not be the correct set of French chars.
41 5A
61 7A
C0 C2
C6 CA
CC CE
D2 D4
E0 E2
E6 EA
EC EE
F2 F4
--
say "Contains French exemplar characters" if $string =~ /^\p{French}+$/;
Abigail
Just to be sure: \b will continue to be defined based on \w and \W
and change its behavior as well, right? I'm only asking because \b is
not explicitly listed in this discussion.
Cheers,
-Jan
However if we could tweek 'locale' I'd like to be able to do
use local 'fr';
and have \w match french accented characters as well.
It's not; actually you even need codepoints > FF.
FWIW I agree with simplifying \d and \w everywhere. Using more complex
forms to match more complex sets is good huffman-coding, and is good
code documentation too.
> You can do it this way, not having to depend on whatever may be
> installed on the system:
[...]
Yours seems a much better approach. Being bound to the whims of one's
current system's idea of correct locales, let alone the current user's
setting of the same, is too unreliable. I've seen many errors in system
locales. For example, some of these symlinks senselessly point to ASCII:
darwin% ls - /usr/share/locale/nl*/LC_C*
lrwxr-xr-x [...] 29 Nov 7 2008 /usr/share/locale/nl_BE.ISO8859-1/LC_COLLATE@ -> ../la_LN.ISO8859-1/LC_COLLATE
lrwxr-xr-x [...] 27 Nov 7 2008 /usr/share/locale/nl_BE.ISO8859-1/LC_CTYPE@ -> ../la_LN.ISO8859-1/LC_CTYPE
lrwxr-xr-x [...] 30 Nov 7 2008 /usr/share/locale/nl_BE.ISO8859-15/LC_COLLATE@ -> ../la_LN.ISO8859-15/LC_COLLATE
lrwxr-xr-x [...] 28 Nov 7 2008 /usr/share/locale/nl_BE.ISO8859-15/LC_CTYPE@ -> ../la_LN.ISO8859-15/LC_CTYPE
lrwxr-xr-x [...] 28 Nov 7 2008 /usr/share/locale/nl_BE.UTF-8/LC_COLLATE@ -> ../la_LN.US-ASCII/LC_COLLATE
lrwxr-xr-x [...] 17 Nov 7 2008 /usr/share/locale/nl_BE.UTF-8/LC_CTYPE@ -> ../UTF-8/LC_CTYPE
lrwxr-xr-x [...] 28 Nov 7 2008 /usr/share/locale/nl_BE/LC_COLLATE@ -> ../la_LN.US-ASCII/LC_COLLATE
lrwxr-xr-x [...] 17 Nov 7 2008 /usr/share/locale/nl_BE/LC_CTYPE@ -> ../UTF-8/LC_CTYPE
lrwxr-xr-x [...] 29 Nov 7 2008 /usr/share/locale/nl_NL.ISO8859-1/LC_COLLATE@ -> ../la_LN.ISO8859-1/LC_COLLATE
lrwxr-xr-x [...] 27 Nov 7 2008 /usr/share/locale/nl_NL.ISO8859-1/LC_CTYPE@ -> ../la_LN.ISO8859-1/LC_CTYPE
lrwxr-xr-x [...] 30 Nov 7 2008 /usr/share/locale/nl_NL.ISO8859-15/LC_COLLATE@ -> ../la_LN.ISO8859-15/LC_COLLATE
lrwxr-xr-x [...] 28 Nov 7 2008 /usr/share/locale/nl_NL.ISO8859-15/LC_CTYPE@ -> ../la_LN.ISO8859-15/LC_CTYPE
lrwxr-xr-x [...] 28 Nov 7 2008 /usr/share/locale/nl_NL.UTF-8/LC_COLLATE@ -> ../la_LN.US-ASCII/LC_COLLATE
lrwxr-xr-x [...] 17 Nov 7 2008 /usr/share/locale/nl_NL.UTF-8/LC_CTYPE@ -> ../UTF-8/LC_CTYPE
lrwxr-xr-x [...] 28 Nov 7 2008 /usr/share/locale/nl_NL/LC_COLLATE@ -> ../la_LN.US-ASCII/LC_COLLATE
lrwxr-xr-x [...] 17 Nov 7 2008 /usr/share/locale/nl_NL/LC_CTYPE@ -> ../UTF-8/LC_CTYPE
darwin% ls -l /usr/share/locale/es*/LC_C*
-r--r--r-- [...] 2518 May 31 2008 /usr/share/locale/es_ES.ISO8859-1/LC_COLLATE
lrwxr-xr-x [...] 27 Nov 7 2008 /usr/share/locale/es_ES.ISO8859-1/LC_CTYPE@ -> ../la_LN.ISO8859-1/LC_CTYPE
-r--r--r-- [...] 2518 May 31 2008 /usr/share/locale/es_ES.ISO8859-15/LC_COLLATE
lrwxr-xr-x [...] 28 Nov 7 2008 /usr/share/locale/es_ES.ISO8859-15/LC_CTYPE@ -> ../la_LN.ISO8859-15/LC_CTYPE
lrwxr-xr-x [...] 28 Nov 7 2008 /usr/share/locale/es_ES.UTF-8/LC_COLLATE@ -> ../la_LN.US-ASCII/LC_COLLATE
lrwxr-xr-x [...] 17 Nov 7 2008 /usr/share/locale/es_ES.UTF-8/LC_CTYPE@ -> ../UTF-8/LC_CTYPE
lrwxr-xr-x [...] 28 Nov 7 2008 /usr/share/locale/es_ES/LC_COLLATE@ -> ../la_LN.US-ASCII/LC_COLLATE
lrwxr-xr-x [...] 17 Nov 7 2008 /usr/share/locale/es_ES/LC_CTYPE@ -> ../UTF-8/LC_CTYPE
I believe that shows why you must not depend on system locales.
Ctype definitions apply to character classes, but these may not be
correct, as shown above. Unicode properties are more predictable,
and although they might not be what you want (eg, \p{IsDigit} vs
[0-9]), you know what they are. Right?
For collation, I advise ignoring locales altogether and using only
Unicode::Collate objects' sort method. You might wish to add arguments
for special treatment of say, ij in Dutch or ch in Spanish, but then
you know what you're getting, having spelt out your rules yourself.
That's why I see explicitly listing characters you want to count as
being "in a language" the way you've done, Abigail, as more reliable.
Language rules lie outside of what Unicode addresses, and POSIX locales
cannot be counted on. Neither can user settings of the same.
--tom
PS: Contrary to widespread misunderstanding ASCII is insufficient even
for English, as it cannot be correctly expressed in ASCII alone.
Otherwise you're doomed to an absurd, telegraph-like reduction.
>>> \w should be its historical meaning
>> Careful: wouldn't historical meaning include
>> locales, wherein \w would also include (for example)
>> é and ç in French, ñ in Spanish, ß in German,
>> and ð and þ in Icelandic? And didn't we already
>> find that locale-shifting char classes made
>> life really hard on the regex engine (at least)?
> That[']s why in an earlier email I proposed \Z{w} to
> handle local[e]s so \w stays fixed.
Or broken? :)
> However if we could twe[a]k 'locale' I'd like to be able to do
> use local[e] 'fr';
> and have \w match [F]rench accented characters as well.
But this is not at all so easy as you may think!
It's asking a lot for the Perl core, so much that it seems
better served by something in some Lingua::FR:: namesapce.
[ELABORATION FOLLOWS IN SOME DETAIL]
For example, French orthography requires not only
the three accent marks:
* the acute accent, for é
* the grave accent, for è, à, ù
* the circumflex accent, for any of the five vowels,
as in fenêtre, forêt, côte, sûr
But accent marks aren't enough, nor may those be arbitarily
applied. Beyond accent marks, proper French orthography also
demands several other mandatory diacritics and digraphs, such as:
* the cedille to mark a soft c before [aou], as in
ça and français itself
* the di[a]eresis to mark a vowel in hiatus rather than in diphthong,
as in maïs, Haïti, Noël, capharnaüm; also occasionally after
g as in ambiguë, argüer, mangeüre; plus for a few imported words
* the digraph œ, mandatory in words like œuf, bœuf, chœur, œnologie
* the digraph æ, now needed (I believe) only for l'æschne in
current French, but perhaps also for Latin imports like
curriculum vitæ and et cætera
You'll also have to consider combining characters besides precomposed
ones. And you must be picky: just because you accept, for example,
the acute accent for é doesn't mean you accept á or ć; those are
illegal in native French words. They might each respectively occur in
Spanish or Polish, but so might many other things. Still, those might
well appear in an otherwise French text, so what do you do then?
Note this is current French only; historically other possibilities may
have occurred which should probably be considered.
And I've not even thought about what might be done with hyphenated words
or those with apostrophes. You can't leave them out; those are no more
multiple words than they'd be in English. Their hyphens and apostrophes
are required elements without being per se "letters". If you're going
to do \w "right", you really ought to think about such components, too.
I disavow any expertise in French, so matters may be otherwise
than as presented here. I doubt they're substantially simpler.
So you see, it's well beyond any mere matter of "accent marks".
While in actuality (French sense :) we might indeed have here
the knowledge needed for French (naming no names :), we surely
lack it for many other languages this would open the door for.
I don't mean there's nothing to what you said, John.
But I dread seeing such language-related rulesets in a core pragma within
the standard Perl distribution instead of off in some Lingua::*:: module on
CPAN, created and maintained by people with true expertise in each language.
--tom
In reading these comments all at once, I'm not sure we are all on the
same page as to the proposal, and what happens now. So, let me state
what I think both are; correct me if I'm wrong:
The way it works now:
With a 'use locale' or on an EBCDIC platform:
they match whatever the C language ctype routines say they match:
isdigit() for \d, isspace() for \s, and isalnum() for \w (but I know \w
adds underscore but I didn't see where it was doing that in a quick scan
of the code).
Absent a 'use locale' and not on an EBCDIC platform:
If (the string being matched against doesn't have the utf8 flag on.
&& the regular expression doesn't contain something that would
make it look like it should behave in utf8 semantics. Any \p{}
in it, for example, will force it into utf8)
{
\d = [0-9]; \w = [_a-zA-Z\d]; \s = [ \t\f\r\n]
} else {
they match what Unicode says, except that there are some bugs so that
\w matches too much, like fractions.
}
What I meant to say was the proposal:
No change to 'use locale' or EBCDIC. Even if we could deprecate 'use
locale', we would be stuck with supporting it in 5.12, I think.
Otherwise, \d = [0-9]; \w = [_a-zA-Z\d]; \s = [ \t\f\r\n]
regardless.
Yes I am aware of this. The BOUND regops handler (and relatives) need
to be fixed.
It will probably addressed along with /\s/ and stuff.
cheers
With the caveat that I am replying prior to finishing my first cup of
coffee of the day I think this looks right.
Just FYI, the above ought to work pretty darn well *except* that it
isn't skipping over the contents of (?#), (?{}), (??{}) blocks where
the characters \w might be otherwise meaningful. I'm sure something
like http://search.cpan.org/dist/Regexp-Parser do the right thing.
Josh
20:50 <@TimToady> it's terribly bad Huffman coding to restrict \w to ascii
20:50 <@TimToady> obra__: see ^^
20:50 <@TimToady> please kill that idea
20:51 <@TimToady> Perl 5 must not revert to ASCII semantics where we've been
gaining ground on Unicode for many years.
So. Now we know where we're aiming. How good are our tests? ;)
That's a pity.
But more importantly, did Larry express any sentiment regarding \d?
Abigail
Now he has:
22:05 <obra_> TimToady: ping? Abigail asks if the same ruling applies to \d (and presumably also \s)
22:06 <@TimToady> yes, it does
> Abigail
>
--
Probably sacrilegious - but I think this is a poor decision, and don't
see how it is Rule 1 material. Do we see Larry around here any more? I
do not see posts, and Rule 1 via proxy with cut + paste from IRC is like
dumping a girl friend via a text message. If Larry really cared, he
should post something eloquent on this mailing list - not a one line
summary that says "we've come this far already - no turning back now."
Then again - my usage of Perl has been on a steady decline, and I've
seen the same trend elsewhere. By the time Perl perfectly supports
Unicode, after so many imperfection evolutions, I predict Perl 5 will
not be relevant nor a target programming language for applications that
require full Unicode support. There are so many alternatives now-a-days,
that do at least as good a job, that it is inevitable for this to occur.
Perl 5's main staying power in my opinion was it's portability - but
major changes in behaviour across releases seriously degrades
portability, and the widespread use of XS in CPAN modules also works
against portability. Choosing to use Perl as the best-in-choice language
for an application in 2009 is becoming a lot more difficult.
Feel free to begin throwing rotting fruit at me and/or banning me from
the list.
Seems like the distinction between matching a character that is a digit vs.
matching ascii digits is mostly about what you do with the numbers
afterwords. Perhaps it's better to just remove the extra duplication?
I am *extremely* upset at this.
Not only was your question not what we propose to do, but the fact
that you didnt discuss it with me, or have this discussion with me
makes me extremely unhappy.
We have bugs. Bugs cant be resolved by appeal to Larry.
And since Larry isnt doing the work, and likely *I* will be then I
reckon i should have been included in the discussion.
Im really tempted to say find yourself another regex engine hacker.
cheers,
And actually, I simply fundamentally reject this as a rule 1 ruling.
If you want to get a rule 1 decision on this that I will accept plan
to have me in the discussion.
Communities should function on merit, not on laurels. In terms of merit,
I vote to accept a decree from the most active and influential members
of the community with a history of recent successes - not from a past
active and influential member with past successes. I consider Yves to be
high on the list deserving of title based upon merit.
Free software is not "owned" by any single party. If Perl is free, it
cannot have a single person or company having sole veto power of its
future. Rule 1 and 2 need to be updated or removed entirely. In Canada,
we still have "on the books" that the Queen of England can veto any
decisions our government makes - but if the Queen ever used this
authority, disrespecting the choice of our nation, we would quickly
assert our independence and remove this power. The situation here is
similar - I cannot remember the last time Larry Wall posted to this
newsgroup or submitted a patch.
This community needs an injection of nurture and good will. If Larry is
reading this - I mean nothing personal. Perl was great for me for many
years, and I am glad you wrote it instead of whining about *sh/awk as
others did. But that was in the early '90s and before. It is now over
20 years later. The torch for Perl 5 has passed on to other people, as
it should. I think some respect for these new torch bearers is deserved.
The laurels should be passed on to those who maintain an active and
influential reputation based on merit. The veto power should be banished
or updated to reflect the current state of the community.
I don't accept that any person can be right even when they are wrong. I
reserve this sort of faith for my God and none other. Perl is a
programming language. It is a tool to fulfill a purpose. It is not divine.
> Free software is not "owned" by any single party. If Perl is free, it
> cannot have a single person or company having sole veto power of its
> future. Rule 1 and 2 need to be updated or removed entirely. In Canada,
That's fine with me.
You're as free as anyone else to fork http://perl5.git.perl.org/perl.git
within the terms of the licence, and go forth and build up a userbase.
Nicholas Clark
Hi, while I am sympathetic to much of what you wrote here I do have a
different core opinion.
Every healthy community ends up with checks and balances to keep the
things flowing, in particular there is usual someone that serves the
role of chief justice. I view Larrys roles as pretty much chief
justice. As such I do accept his right to make final decisions
regardless as to his patch/post rate over the past while. Perl is his
and I absolutely do accept his right to make a final ruling.
What I do not accept, and I believe I have at least some support for
this in the community, is that a Rule 1 hearing has even happened.
If there is going to be a Rule 1 hearing then IMO the question needs
to be clear and the relevant parties able to present the facts. This
hasn't happened ergo there is no ruling.
Additionally my view is that even if I were to concede that such a
hearing has happened, and I do not, the question and answer are in my
opinion not relevant to the plans we have for \w \s and \d as we don't
plan to do what the question asked, and thus the answer is irrelevant.
Although it does imply that at one facet, that of default behaviour,
is resolved. Although I view this is non-controversial, as a careful
reading of my most recent posts say pretty much the same thing, with
the possible exception of \d, which is a subject I still consider open
on security grounds alone.
For the right to be recognized means for the community policies to
acknowledge both the right, and the cost of a fork.
Above, you acknowledge the right, but not the cost, nor did you respond
to the question.
In terms of cost - I've seen the quick response of "go ahead and make
your own fork, and build up your own userbase" given in numerous forms
to numerous people with concerns - but there is a problem with this
statement. The word "your" suggests an emphasis on the investment
require to set up a fork in an effort to discourage or discredit the
concern. "Unless *YOU* are willing to invest in the effort to make a
successful fork, *YOUR* opinion is not worth considering." A community
should always be "our" - not "my" or "your". I think any argument that
relies on the inability of the challenger to personally invest in a
complete competitive replacement for the community is weak. So weak, in
fact, that forks are pretty common, especially for free / open source
projects. Many community leaders have found their entire community to
leave as a direct result of their failure to collaborate and compromise,
resting on the assumption that a fork is too much effort to occur.
In terms of the question:
Rule 1: Larry is always by definition right about how Perl should
behave. This means he has final veto power on the core functionality.
Yves: I simply fundamentally reject this as a rule 1 ruling.
Does the community vote to adhere to Rule 1 and enforce the ruling,
including putting in the effort to make the changes as required by Rule
1? Or does the community vote to waive Rule 1 in this case for Yves,
leading to a precedent of treating it as optional policy in the future?
Either Rule 1 will be enforced or it won't be. I think it shouldn't be,
and I think it won't be. I think the right to reject rule 1 has been put
on the table, and that the community should support Yves. Note that
right or wrong is not a factor here - the real complaint from Yves is:
> And since Larry isnt doing the work, and likely*I* will be then I
> > reckon i should have been included in the discussion.
This is a legitimate complaint. The right to be included in the
discussion, though, rejects rule 1. Rule 1 is pretty arrogant.
What do you think?
You may not have meant anything by what you responded - perhaps you even
meant to be benevolent. :-) You've walked into my rant, and if you find
my response offensive our outrageous, I apologize in advance. :-) It is
meant to be thoughtful and reflective.
> Does the community vote to adhere to Rule 1 and enforce the ruling,
> What do you think?
The community can vote all it frigging likes.
What matters is who contributes code, and whose thoughts influence those who
contribute code.
Anyone is free to have an opinion. Don't get me wrong on that.
But the conversion of opinions to actions determines what happens.
I've seen someone *complain* that perl5-porters is a meritocracy, because
that means that they're ignored. (Forget whom, and it was somewhere on IR)
(Ignore for the moment whether it is or isn't actually functioning as a
meritocracy).
Which struck me as naive on the part of the complainer, because the implicit
in the complaint was a rejection of "he who pays the piper calls the tune"
The community can do what the hell it likes. But if it doesn't cause people to
1: answer bug reports about the perl core code
2: locate the causes of bugs in the perl code
3: fix those bugs
4: contribute improvements to the perl core code
then the community, or those parts of it uninvolved in the above is
irrelevant here.
"cause" can be contribute time, contribute code, contribute funding to any
entity capable of converting money into the previous two.
Nicholas Clark
Haha. I think you believe in rule 2, then. :-)
Yes - we are different on this. I think chief justice as you call it
should be an elected term position, and should require certain
responsibilities which include a minimum level of activity in the
community. I don't believe in benevolent dictators - and especially not
absent or casual benevolent dictators.
I think the community should have the power to shape its own rules, and
the community should evaluate these rules periodically to ensure they
are providing the most value.
Anyways - unless other people feel the same, my opinions might be that
of a secluded community of one. In any case - I don't care what happens,
but I do want to ensure that you are respected, and that you are
influential in any discussion related to the substantial work you have
contributed, which specifically includes regex and unicode. You deserve
a hearing - and if it came right down to it, I would accept your call
over an absent benevolent dictator, even if it disagreed with my own
position. Thank you for your contributions.
I'll leave your mailboxes clear for a while again without my drivel.
Have a good week.
On Wed, Oct 28, 2009 at 09:52:42AM +0100, demerphq wrote:
> 2009/10/28 jesse <je...@fsck.com>:
> > Tonight on #perl6, Larry made a fairly definitive statement about \w
> > matching and unicode:
> >
> > 20:50 <@TimToady> it's terribly bad Huffman coding to restrict \w to ascii
> > 20:50 <@TimToady> obra__: see ^^
> > 20:50 <@TimToady> please kill that idea
> > 20:51 <@TimToady> Perl 5 must not revert to ASCII semantics where we've been
> > gaining ground on Unicode for many years.
> >
> > So. Now we know where we're aiming. How good are our tests? ;)
> >
>
> I am *extremely* upset at this.
>
> Not only was your question not what we propose to do, but the fact
> that you didnt discuss it with me, or have this discussion with me
> makes me extremely unhappy.
Just for the record, I did not ask Larry a question about the regex
semantics or ask Larry to make a ruling. Larry made a statement on #perl6
and asked me to convey it to perl5-porters. I'd thought that any
changes to the regex engine in this area were currently on-hold as
"insanely complicated and need a bunch of work to get sorted out."
There _were_ a number of conflicting ideas about what \w and friends
should match by default. Some smarter and some crazier. In general, I had
understood us to be in good (no worse than 5.8) shape with regard to \w,
\d, and \s at the moment.
Larry doesn't say "must not" about Perl 5 often. When he does, it's
certainly newsworthy.
I promise that I wasn't trying to piss you off. Even to get back at
you for a certain two and a half hour discussion on history editing the
other day ;)
Best,
Jesse
How about some "use ascii;" / "use re 'ascii';"?
--
Ruud
The original plan WAS overreaching and a bad idea and mea-culpa.
The refined plan was to make it configurable, with the exception of
\d, which I and many believe should default to ascii semantics as
there are very few applications where \d matching anything else is the
right thing to do.
If Larry really believes that \d matching thai digits, superscripts,
subscripts, and other bizzare things is the right huffman encoding for
\d, then I want him to say it on list himself so that when i close
tickets related to the subject I can point at his email. In particular
right now \d matches the following codepoint ranges:
0030 0039
0660 0669
06F0 06F9
07C0 07C9
0966 096F
09E6 09EF
0A66 0A6F
0AE6 0AEF
0B66 0B6F
0BE6 0BEF
0C66 0C6F
0CE6 0CEF
0D66 0D6F
0E50 0E59
0ED0 0ED9
0F20 0F29
1040 1049
1090 1099
17E0 17E9
1810 1819
1946 194F
19D0 19D9
1B50 1B59
1BB0 1BB9
1C40 1C49
1C50 1C59
A620 A629
A8D0 A8D9
A900 A909
AA50 AA59
FF10 FF19
104A0 104A9
1D7CE 1D7FF
This list has changed a couple of times as I recall, and no doubt will
again in some future version of unicode. So is that really right? For
\w the case is arguable either way so I dont object to making it match
unicode, but for \d? Do we really want to force every person matching
url parameters for an id to use [0-9] instead? This has time and again
been remarked upon as a bad call and a bug in waiting, to the extent
that many people avoid \d as being altogether too risky.
Your quote of a less than 60 second response to this question is not
sufficient in my book. And I think I've done enough work on Perl to
deserve such a mail.
> I promise that I wasn't trying to piss you off. Even to get back at
> you for a certain two and a half hour discussion on history editing the
> other day ;)
I understand. However I do think that if Larry wants to invoke Rule 1
he has to do it on list personally and address the concerns involved.
I don't really think that is an unreasonable expectation, at least in
this case.
As for most of the potential changes to \w and \s, I have not much opinion. In
all my code that expects Unicode, I have been careful, and I hope others have,
too.
As for \d, though, I am horrified to think how much bad behavior could be
introduced if \d started to match TITLE CASE KLINGON NUMERAL CHORGH
I think it is likely that I would not upgrade to a perl5 that introduced such
behavior. "Review every regex that uses \d" is not an acceptable burden.
--
rjbs
What you just described is the present situation. And many people have
this bug and have done exactly what you said.
If unicode adds that codepoint, and gives it the property IsDigit then
it will start to match in some version of Perl in at least some
situations. The question is which situations those should be.
I stand both corrected and astonished.
--
rjbs
For \w to have Unicode semantics, and \d not, is IMO, worse than it's
now. By all means, whatever you do, make \w, \d, and \s either all match
ASCII only, or let \w, \d, \s be short hands for \p{IsWord}, \p{IsDigit}
and \p{IsSpacePerl}. But don't mix and match.
Abigail
Just for the record, Unicode has resisted so far the attempts to
formalize Klingon. But, it is available in an unofficial Private Use
area, partitioned and registered at http://www.evertype.com/standards/csur/
U+F8F0 KLINGON DIGIT ZERO
U+F8F1 KLINGON DIGIT ONE
U+F8F2 KLINGON DIGIT TWO
U+F8F3 KLINGON DIGIT THREE
U+F8F4 KLINGON DIGIT FOUR
U+F8F5 KLINGON DIGIT FIVE
U+F8F6 KLINGON DIGIT SIX
U+F8F7 KLINGON DIGIT SEVEN
U+F8F8 KLINGON DIGIT EIGHT
U+F8F9 KLINGON DIGIT NINE
Coincidentally, a linguist friend of mine told me this week that Klingon
is the 2nd most widely spoken made-up language in the world, after
Esperanto. So, Unicode may encode them. They already did encode GB
Shaw's alphabet, a Mormon alphabet, and there are serious proposals to
encode JRR Tolkien's alphabets.
So is anyone working on making perl's atof function handle all these
additional code points? What's the unicode for NaN anyway?
--
warlorded myself
I'm working to expose these other properties.
will
say 0+'𒑢'
give 0.25?
That would be cool.
I agree with Yves. This kind of decision needs a better airing than we
have received. It's not clear from this excerpt if Larry is aware of
the discussions and issues that have gone on in this forum. For
example, is he aware that this was to be configurable?
--
Yes I agree. I think it is best to let this thread die.
I will write up a new summary of what I believe to be a sane plan, and
then we can see if there really is anything to argue about.
As you say here, I believe that the only area of controversy *at this
point* is about \d, and I think that we can sort it out as a community
without resorting to rule 1 intervention. :-)
I carry a great deal of responsibility for what controversy there is
as I vastly underestimated the impact of changing the default
behaviour *at all* would cause, and it *is* reasonable that people
thought that original plan was a bad thing. I am sorry for any
community heartache that has caused.
Further more taint washing is carried out by regexes and extending the
samantics of \w \d and \s could allow tainted data to be cleaned where
it should not.
If you think there are a lot of scripts out there that use do (LIST)
that will paile into insignificance to those scripts that assume \w ≡
[a-zA-Z0-9_]
Oh and from http://perldoc.perl.org/perlre.html
\w Match a "word" character (alphanumeric plus "_")
Now I don't see alphanumeric defined anywhere but I also don't see how
it can be forced to match 灞
John
______________________________________________
This email has been scanned by Netintelligence
http://www.netintelligence.com/email
Wrong tense.
The status quo, since 5.8 (with complicated status in 5.6) is that \w and
\s are unusably broken, but attempt to match Unicode character classes.
\d also attempts to match a Unicode character class, and afaik does
so successfully.
-zefram
Now I don't see alphanumeric defined anywhere but I also don't see how it
> can be forced to match 灞
>
It already does
$ perl -v
This is perl, v5.8.8 built for i486-linux-gnu-thread-multi
...
$ perl -le'print chr(28766) =~ /^\w\z/ || 0'
1
Further more taint washing is carried out by regexes and extending the
> samantics of \w \d and \s could allow tainted data to be cleaned where it
> should not.
>
If you're using \w to filter out chinese characters, you're already failing.
What do you think extending \w \d and \s will do.
>
There's been no discussion of expanding them. The problem is that what they
match varies depending on Perl internals
$ perl -le'
$s1 = "\xC2";
$s2 = "\x{2660}";
for ($s1, $s2, $s1.$s2) {
print /\w/ || 0;
}
'
0
0
1
If there's no \w in s1 or in s2, why does their concatenation have one.
The fundamental question is, "what do you use \w for"? And is this usage
correct?
I seldomly use \w unless I know the text I match against is ASCII only.
And when I do use it against non-ASCII text, I know I'm creating technical
debt.
\d, I never use anymore. And I spend a lot of time explaining to people
that their use of \d in a regexp is actually wrong.
Abigail
Yesterday I read in perl5110delta that \s, \w and \d would change to
ascii-only. I thought this was a bad idea for several distinct reasons,
and briefly discussed the change on IRC and Twitter, but not
perl5-porters, primarily because a discussion on this list often goes on
for a while. (To convince a group of highly knowledgeable people is
incredibly energy consuming; I'm not doing that anymore.)
Today I found perl5111delta; somehow I had failed to notice that the
changes were already reverted.
By the way, I asked in #perl6 about the direction for Perl 6, not
knowing that Larry would feel strongly about it, and not knowing he'd
invoke Rule 1. However, I'm glad that he did.
Now with "Rule 1" invoked and the changes already reverted, I feel
confident enough that I can post my thoughts here without the pressure
of having to convince anyone, or thinking I should.
For me, it's not just about embracing Unicode.
Ignoring Perl 5.11.0, there's a clear bug in Unicode capable Perl 5s
regarding string semantics. It took a while to reach concensus that this
is indeed a design bug, a broken abstraction: the semantics of several
operations are dependent on the internal representation of otherwise
indistinguishable values.
Specifically, several operations take only codepoints in the ASCII range
into account when the internal encoding of the operand string is not
UTF-8. This applies to built-in character classes and their shortcuts
and for functionality that deals with letter case (uc, lc, /i).
The historical behavior was to match only ASCII, but Unicode support was
added later. Most people remained ignorant about this change and its
implications. But anyone who is aware of the added Unicode support is
bitten by the hard to predict distinction between two sets of semantics.
Now, it's fairly simple to force Perl to use only the Unicode semantics.
Just utf8::upgrade the string and these operators and the regex engine
will behave predictably. And that's what I told people to do. Upgrade
your strings until Perl is fixed. I've told IRC, I've told YAPC and
workshop audiences, Perl Monks, and readers of the now-official
perlunifaq. I've heard others echo the advice and I've seen lots of
utf8::upgrade's in the wild.
Throughout the years I have always assumed that Perl would be fixed by
abandoning the ASCII-only behaviour, embracing Unicode as the default as
this had been the direction ever since 5.6. This assumption is
reflected in much of my writings and several talks on the subject.
It became painfully clear that a fix could never be made fully backward
compatible (except if the fix was enabled conditionally), but the nice
thing about the utf8::upgrade workaround is that you can repair your
existing code in a way that will continue to work even after the bug is
fixed. You can add the workaround to your code now, upgrade to 5.12
later, and then wait a few years before removing the calls to
utf8::upgrade or just leave them there. Even if utf8::upgrade were ever
removed from Perl, it'd be trivial to make it a no-op. It is safe to
add utf8::upgrade and be fairly certain that your code will continue to
function as it does today. (Modulo only some property changes in the
Unicode spec; this causing real problems is very rare.)
That is, until 5.11.0 introduced intentional regression. While ASCII-only
has worked well in the past, and may in specific circumstances even make
more sense in terms of performance and security, I've called going back
to this is an insanely bad idea. Larry agreed, noting that it strays
from the path of gaining on Unicode and that it is poor huffman coding.
But as I said, to me it is not just about embracing Unicode. It's also
about compatability. I agree that this is one of the incredibly rare
occasions where it's acceptable and maybe even necessary to break
backward compatibility. Going to Unicode-only means that the breakage
can be controlled by programmers. They can add utf8::upgrade statements
before upgrading perl, to make their code forward compatible. It
provides a way to prepare for the hefty change, and many have already
gone through their codebases looking for places to add this workaround.
Going (back) to ASCII-only, however, would not provide such a clean
upgrade path. There is no way to make your 5.10 code forward compatible
with 5.11.0, regarding \[dws], because there is no way to disable
matching out-of-ASCII-range characters in Perl 5.10. So the only way to
ensure a clean upgrade is to go through your code removing all uses of
\d, \w, and \s that could possibly match Unicode characters. (Which
would be extra painful for everyone who had already gone through it to
add utf8::upgrade calls!)
(Granted, there is a way that forces ASCII-only semantics but itbreaks
the whole flow of "receive, decode, process, encode, send": just encode
the string to UTF8 temporarily, do your match, and decode again.)
So I'm glad that 5.11.1 comes with renewed sanity, and I hope that
Larry's Rule 1 invocation will prevent the ASCII-only thing from
happening again. Perl 5.12 string semantics must be the same as Perl
5.10 semantics on utf8::upgrade'd strings; everything else should only
happen if explicitly requested by the programmer, preferrably as
lexically local as possible.
In the past I have suggested adding a /a flag to the regular
expression engine. (Blissfully unaware of how hard this would be to
implement.) It would be useful for those cases (mostly sysadmin work)
where you want to match only ASCII characters. It'd have to be a flag
instead of a pragma, so it survives in qr, and so it can be negated in
a subregex. I still believe that such a flag could be useful. (But I
absolutely do not insist on having it.)
--
Met vriendelijke groet, Kind regards, Korajn salutojn,
Juerd Waalboer: Perl hacker <#####@juerd.nl> <http://juerd.nl/sig>
Convolution: ICT solutions and consultancy <sa...@convolution.nl>
> I think chief justice as you call it
> should be an elected term position
Democracy has nothing to do with this, and we must make sure that it
never will. Consensus between involved people, and pro-activity; all
else needs mostly to be ignored.
--
Ruud (not involved much, so ignore)
> How difficult would it be to introduce special chars which aren't
> charclasses, which are probably more suitable for what people want anyway
> (things that agree with grok_number, with rules for natural numbers,
> integers, decimal fractions, and floating point notation)?
>
> Seems like the distinction between matching a character that is a digit vs.
> matching ascii digits is mostly about what you do with the numbers
> afterwords. Perhaps it's better to just remove the extra duplication?
Vim uses foo vs \_foo to distinguish whether a linefeed is included or
not; e.g.
abc. <= literal followed by anything except linefeed
abc\. <= literal followed by anything including linefeed
Maybe we can find some suitable mangling to apply to \w, \d, \s, etc...
to say "with extra Unicode chars like these"
--
Paul "LeoNerd" Evans
leo...@leonerd.org.uk
ICQ# 4135350 | Registered Linux# 179460
http://www.leonerd.org.uk/
It already matches non-ascii.
And no it doesnt work. And frankly the idea of making it work doesnt
make a lot of sense to me.
You really want "\x{0E50}\x{0ED0}" to be another way to write "11"?
They arent even in the same script.
\begin{not-really-serious}
I suggest \ḋ, \ṡ, and \ẇ for the Unicode character classes, and
\d, \s, \w for the ASCII versions.
For those not able to read my suggestions, it's
\N{LATIN SMALL LETTER D WITH DOT ABOVE} \x{1E0B}
\N{LATIN SMALL LETTER S WITH DOT ABOVE} \x{1E61}
\N{LATIN SMALL LETTER W WITH DOT ABOVE} \x{1E87}
\end{not-really-serious}
Abigail
> And no it doesnt work. And frankly the idea of making it work doesnt
> make a lot of sense to me.
>
> You really want "\x{0E50}\x{0ED0}" to be another way to write "11"?
>
> They arent even in the same script.
No, I don't. That was meant to be an appeal to absurdity to suggest
"don't do this" :)
> It already matches non-ascii.
Ah. Then that's unfortunate, as now we can't use $1 numerically after
capturing it with (\d+), and know it'll work. This is what I was getting
at..
> On Wed, 28 Oct 2009 09:25:30 +0200
> Yuval Kogman <nothi...@woobling.org> wrote:
>
> > How difficult would it be to introduce special chars which aren't
> > charclasses, which are probably more suitable for what people want anyway
> > (things that agree with grok_number, with rules for natural numbers,
> > integers, decimal fractions, and floating point notation)?
> >
> > Seems like the distinction between matching a character that is a digit vs.
> > matching ascii digits is mostly about what you do with the numbers
> > afterwords. Perhaps it's better to just remove the extra duplication?
>
> Vim uses foo vs \_foo to distinguish whether a linefeed is included or
> not; e.g.
>
> abc. <= literal followed by anything except linefeed
> abc\. <= literal followed by anything including linefeed
Sorry; I meant
abc\_.
Well, one of the reasons for \w to match more than ASCII characters
(first with locale, later with Unicode) was that it should be possible
to process 'words' in foreign scripts as well.
If we want '\w+' to be able to match Klingon "words", why shouldn't \d+
match Klingon numbers? Yes, \d+ matches digits from different scripts,
but \w+ matches word characters from different scripts as well.
Now, don't consider this an argument in favour of having \w match non-ASCII
characters - but, IMO, if \w can match non-ASCII characters, so should \d.
Abigail
> Now, don't consider this an argument in favour of having \w match non-ASCII
> characters - but, IMO, if \w can match non-ASCII characters, so should \d.
This would seem to make the most sense, and be the most predictable.
Either all of them match Unicode, or none of them do.
If none of them do, then adding Unicode variations might be a nice idea.
I would suggest
word digit space
ASCII-only \w \d \s
Includes Unicode \Uw \Ud \Us
Only \U is already used. And \u.
Do we have a definitive list anywhere, on a tangential note, of the
remaining unused \x letters?
> Now, don't consider this an argument in favour of having \w match non-ASCII
> characters - but, IMO, if \w can match non-ASCII characters, so should \d.
the constraint that anything that matches /\d+/ should numify to the
described number is a reasonable expectation. The alternative to
disallowing [^0-9] from \d is expanding numification to include
alternatvies. Creating a utof function that knows the values of all
the digitty characters from all scripts would require two steps
besides compiling the list. The first step is a big discussion to
decide how the edge cases, including but not limited to expressions
from mixed scripts, expressions from non-base-ten languages (why I
used cuneiform yesterday), expressions mixing scripts that mix
conventions, ambiguous expressions.
Should attempting to numify Ⅵ0 produce six or sixty or zero and a
warning or throw an exception but only under a new pragma and if so
how should that pragma be enabled, either via strict or autodie?
Should Ⅵ be expressible as ⅤⅠ?
If the base-ten atoi algorithm is simply applied based on unicode
numeric values, for instance, Ⅵ would be six but ⅤⅠ would be
sixty-one.
The second step, possible to do simultaneously with the ongoing first
step (the continuing proceedings of the working group on unicode to
numeric conversions in Perl) is implementing the decisions.
> I would suggest
>
> word digit space
> ASCII-only \w \d \s
> Includes Unicode \Uw \Ud \Us
>
> Only \U is already used. And \u.
Actually then, on that note could we consider some more modifiers?
ASCII: m/\w/a m/\d/a m/\s/a
Unicode: m/\w/u m/\d/u m/\s/u
and if neither is specified keep to the existing behaviour..
25% is still available (13 out of 52 upper and lower case ASCII characters).
perlrebackslash.pod lists all \x letters in use. It's easy to deduce
the unused ones: \F, \i, \I, \j, \J, \m, \M, \o, \O, \q, \T, \y, \Y.
\c, \g, \k, \p, \P, \x are "partially available", that is, currently they
can only be followed by a limited set of characters, so there's some
room for expansion left.
\N is partially available in 5.10.x, but taken in blead.
Abigail
OTOH, not everything that numifies matches /\d+/.
> The alternative to
> disallowing [^0-9] from \d is expanding numification to include
> alternatvies. Creating a utof function that knows the values of all
> the digitty characters from all scripts would require two steps
> besides compiling the list. The first step is a big discussion to
> decide how the edge cases, including but not limited to expressions
> from mixed scripts, expressions from non-base-ten languages (why I
> used cuneiform yesterday), expressions mixing scripts that mix
> conventions, ambiguous expressions.
>
> Should attempting to numify Ⅵ0 produce six or sixty or zero and a
> warning or throw an exception but only under a new pragma and if so
> how should that pragma be enabled, either via strict or autodie?
Roman numberals are classified as "Number Letter" in the Unicode database,
and hence don't match /\d+/, which matches *digits*.
Abigail
And then the 'short hand' for "[0-9]" becomes "(?a:\d)". ;-)
/(?u-a:\d)/: match any non-ASCII digit?
Abigail
We could extend numification to handle any or all Unicode code points
that have a numeric value. But I don't think \d should match anything
more than decimal digits. There is a CJK ideograph that means 10**12.
People are expecting \d to match a single digit.
Oh. I left out the possibility of having nonsensical mixed expressions
return NaN after they warn, in case someone trying to implement
Text::Numeric::Any reads this thread some day.
> We could extend numification to handle any or all Unicode code points that
> have a numeric value. But I don't think \d should match anything more than
> decimal digits. There is a CJK ideograph that means 10**12. People are
> expecting \d to match a single digit.
so if \d only means [0-9] plus various other kinds of [0-9] in
different writing systems, and doesn't include, for instance, [①-⒛],
without changing the semantics any, utoi and utof pretty much write
themselves. What's the range of characters permissible for the point
in floating point and should \. match them too, if there are more?
> Non-unihan Unicode has three types of numbers. Decimal digit, other
> digit, and other numeric. \d only matches decimal digits, which is the
> "right" thing in my mind. "other digits" are like superscript 1. I
> think it is a reasonable argument to make that \d shouldn't match
> anything that you can't numify automatically; I don't think it is a good
> idea to have it match superscripts nor roman numerals, nor fractions.
> We could extend numification to handle any or all Unicode code points
> that have a numeric value. But I don't think \d should match anything
> more than decimal digits. There is a CJK ideograph that means 10**12.
> People are expecting \d to match a single digit.
There may be issues even with just that. What's a "digit",
really? Some code points that by name call themselves DIGITs
are \d, but some calling themselves DIGITs aren't--they're \D,
like all the counting-rod digits:
\D U+1d360 COUNTING ROD UNIT DIGIT ONE
\D U+1d369 COUNTING ROD TENS DIGIT ONE
True, those seem no great loss to ignore, but I'm not so sure
all others are. Some scripts' DIGITs are actually mixed \d
and \D, like:
\d U+00f21 TIBETAN DIGIT ONE
\D U+00f2a TIBETAN DIGIT HALF ONE
\d U+00c67 TELUGU DIGIT ONE
\D U+00c79 TELUGU FRACTION DIGIT ONE FOR ODD POWERS OF FOUR
I imagine, but do not know, that numbers in those scripts might
be composed of a mixture of \d and \D DIGITs.
Ethiopic DIGITs are *all* \D, and read left to right:
\D L U+01369 ETHIOPIC DIGIT ONE
Whereas all Kharoshthi DIGITs are also \D, but read right to left:
\D R U+10a40 KHAROSHTHI DIGIT ONE
You'd think it'd be less confusing just within say, European
Numbers, but it isn't that obviously so. Our "1", the Arabic
numeral digit one, is of \p{BidiClass:EN}, a European Number
not an Arabic one:
\d EN U+00031 DIGIT ONE
While Arabic-Indic's digit one is instead \p{BidiClass:AN}, an
Arabic numeral that indeed counts as an Arabic Number:
\d AN U+00661 ARABIC-INDIC DIGIT ONE
Yet the *extended* Arabic-Indic's digit one has swapped back to
being a European Number again:
\d EN U+006f1 EXTENDED ARABIC-INDIC DIGIT ONE
If automagic atod()ish nummification [hmm: or numefication?*]
for digit-strings is the goal, I don't know how to handle digit
strings composed of digits from various scripts. Besides the \d
\D issues above, even when restricted to \d digits, directional
concerns remain.
Most go left to right:
\d L U+009e7 BENGALI DIGIT ONE
\d L U+00e51 THAI DIGIT ONE
\d L U+00f21 TIBETAN DIGIT ONE
But Nko digits, which are real \d digits, are written right to left:
\d R U+007c1 NKO DIGIT ONE
I think it's probably prudent to avoid most or all non-\D
numbers, like subscripts and superscripts, even when those
count as European Numbers:
\D EN U+000b9 SUPERSCRIPT ONE
\D EN U+02081 SUBSCRIPT ONE
\D EN U+02488 DIGIT ONE FULL STOP
\D ON U+02460 CIRCLED DIGIT ONE
\D ON U+0278a DINGBAT NEGATIVE CIRCLED SANS-SERIF DIGIT ONE
If somebody is asked to enter the year like for copyrights, then
2009 of course works perfectly fine, and so do some other scripts.
It seems too specific to the problem domain, not to mention just
plain dangerous, to try to cope with Roman numerals, whether
entered as MMIX or in the Unicode versions.
Other writing systems than Roman/Latin also use their letters
as numeric digits. If the current Hebrew year is 2009 + 3360
= 5769, they'd then write that--from right to left--as 769,
dropping the 5000 by convention, this way:
\D R U+005ea HEBREW LETTER TAV (number 400)
\D R U+005e9 HEBREW LETTER SHIN (number 300)
\D R U+005e1 HEBREW LETTER SAMEKH (number 60)
\D R U+005d8 HEBREW LETTER TET (number 9)
Actually, order doesn't really matter: it's not positional, just
accumulative. Since we'd boot the Romans, I guess we'd boot the
Hebrews, too; don't see any way around it, really.
Just a few conundra I've been tossing around.
--tom
--
* numinal: divine.
numinous: of or pertaining to a numen; divine, spiritual,
revealing or suggesting the presence of a god; inspiring
awe and reverence. Hence numiosity, numinousness, numinously.
numify: to apotheosize.
Lots of good points about the breakage of utf-8 vs non-utf-8. This
should be fixed, but it's not really related to concerns I have about
\d. I think you and I agree that users should not need to know whether
string is utf-8 or non-utf-8. All Perl operations should, if at all
possible, operate the same, no matter what internal form it happens to
use to encode the string. It is a leaky abstraction for every exception
to this rule. It's bad, and it has prevented Perl from successfully
reaching the state of "embracing Unicode."
My concerns for \d at well covered by other people, even in posts today.
I have LOTS of code that uses \d written over the last 20 years, and it
is very concerning to me that this code which uses \d as a guard against
invalid input, may be accepting Unicode characters which cause my
programs to break, cause my programs to be exploitable over the network,
or cause data corruption.
I have a lot of code that does something like:
$number = /\A\d+\z/ ? 0+$_ :
die "...";
It scares me that this code may now be broken, or may become broken.
Was I wrong to use \d+? What should I have used? I've been taught to
avoid [0-9] in all languages since before I learned Perl, due to silly
character sets like EBCDIC. Now it seems like my only choice is:
$number = /\A[0123456789]+\z/ ? 0 + $_ :
die "...";
To me, that's just insanity.
I don't think \d should match anything that would allow /\A\d+\z/ to
result in a value where ("$_" != 0+$0). Go ahead and make 0+$0 better,
or go ahead and make \d match only ASCII '0' through '9' - but anything
that causes this "identity" break is a BAD decision. Heck, even go ahead
and make a BAD decision - but the result will be that I recommend
against using Perl. I trust my programming language to do what I tell it
to. If it starts doing stupid things with each future release - I will
stop trusting it, and I will stop using it.
Cheers,
mark
--
Mark Mielke<ma...@mielke.cc>
Here is a reference about Unicode security issues:
http://unicode.org/reports/tr36/
And here is an excerpt from that:
Turning away from the focus on domain names for a moment, there is
another area where visual spoofs can be used. Many scripts have sets of
decimal digits that are different in shape from the typical European
digits {0}. For example, Bengali has {০ ১ ২ ৩ ৪ ৫ ৬ ৭ ৮ ৯}, while
Oriya has {୦ ୧ ୨ ୩ ୪ ୫ ୬ ୭ ୮ ୯}. While the sets taken as a whole are
different in shape, individual digits may have the same shapes as digits
from other scripts, even digits of different values. For example, the
string ৪୨ is visually confusable with 89 (at small sizes), but actually
has the numeric value 42. Where software interprets the numeric value of
a string of digits without detecting that the digits are from different
scripts, it is possible to generate such spoofs.
John
______________________________________________
This email has been scanned by Netintelligence
http://www.netintelligence.com/email
But Character.getNumericValue('Ⅳ') == 4 in Java script so there is a
numeric mapping for the roman numarals.
See http://www.fileformat.info/info/unicode/char/2163/index.htm
If we decide to continue to allow \d to match non-ASCII, and I'm not
advocating that, it should only match decimal digits, regardless of the
names of the characters.
>
> \d U+00f21 TIBETAN DIGIT ONE
> \D U+00f2a TIBETAN DIGIT HALF ONE
>
Here is an example of a poor choice of name, or at least one that is
misunderstood by people. The term DIGIT in Unicode means only that it
is a single character that has a numeric meaning, much like we refer to
the ASCII F as a hexadecimal digit. So a DIGIT in Unicode doesn't have
to mean 0 through 9.
In this particular case, the HALF ONE means that this is 1 - .5 = .5,
which is the numeric value of the character. So it is a digit that
means a non-integral number. I think it should have been named ONE
MINUS HALF. There is a HALF ZERO whose value is -.5. I got curious a
while back as to why Tibetan of all languages in the world would have a
single character encoding the concept of -.5, so I looked it up on the
internet. IIRC, The claim was that all these half digits are based on a
single Tibetan postage stamp of one of the values, and that the others
were inferred artificially based on Tibetan grammatical rules, and there
is no concrete evidence that they really ever existed except for one of
them on that one stamp. There's a picture of it on the internet.
> \d U+00c67 TELUGU DIGIT ONE
> \D U+00c79 TELUGU FRACTION DIGIT ONE FOR ODD POWERS OF FOUR
>
> I imagine, but do not know, that numbers in those scripts might
> be composed of a mixture of \d and \D DIGITs.
U+0C79 also has numeric value one, but it appears to be more like a
remainder after taking a number mod 4, so I doubt that it stands on its own.
It seems we may be presuming a base 10 system inappropriately at times.
>
> Ethiopic DIGITs are *all* \D, and read left to right:
>
> \D L U+01369 ETHIOPIC DIGIT ONE
>
> Whereas all Kharoshthi DIGITs are also \D, but read right to left:
>
> \D R U+10a40 KHAROSHTHI DIGIT ONE
>
> You'd think it'd be less confusing just within say, European
> Numbers, but it isn't that obviously so. Our "1", the Arabic
> numeral digit one, is of \p{BidiClass:EN}, a European Number
> not an Arabic one:
>
> \d EN U+00031 DIGIT ONE
>
> While Arabic-Indic's digit one is instead \p{BidiClass:AN}, an
> Arabic numeral that indeed counts as an Arabic Number:
>
> \d AN U+00661 ARABIC-INDIC DIGIT ONE
>
> Yet the *extended* Arabic-Indic's digit one has swapped back to
> being a European Number again:
>
> \d EN U+006f1 EXTENDED ARABIC-INDIC DIGIT ONE
>
These classifications of or European/Arab number are solely for
implementing the Unicode Bidirectional Algorithm, and don't mean
anything beyond that. Again, using the decimal digit type would be the
way to go, if we continue to go there.
> If automagic atod()ish nummification [hmm: or numefication?*]
> for digit-strings is the goal, I don't know how to handle digit
> strings composed of digits from various scripts. Besides the \d
> \D issues above, even when restricted to \d digits, directional
> concerns remain.
>
> Most go left to right:
>
> \d L U+009e7 BENGALI DIGIT ONE
> \d L U+00e51 THAI DIGIT ONE
> \d L U+00f21 TIBETAN DIGIT ONE
>
> But Nko digits, which are real \d digits, are written right to left:
>
> \d R U+007c1 NKO DIGIT ONE
These are some of the reasons I'm leery of allowing \d to match outside
ASCII by default. I don't get the symmetry argument of Abigail's. A
number of people have responded recently about how they're surprised at
how it really works now. Much code has been written that assumes that
\d is [0-9], and that digits are part of a base 10 number written
left-to-right. I'm not sure right now if there is a way for a program
that doesn't use Encode or the command line options or I/O layers to get
Unicode data unexpectedly. But even if they are, Unicode is a large
beast, and someone may be getting more than they bargained for.
>