> I think scan_word should be using is_utf8_idcont, rather than
> isALNUM_utf8. The attached patch makes it do just this.
The tests had not finished running when I sent that. lib/utf8.t is
failing. It turns out that things are not as simple as I thought.
toke.c has 23 instances of isIDFIRST_lazy_if, so it seems that most of
the code is expecting S_scan_word to match something like
/^(?!\p{IsDigit})[\p{ID_Continue}_]+/
whereas what it actually matches (ignoring package separators) is
/^([\p{IsWord}_]\pM?)*/
My patch prevents qq·aaa· from being valid syntax, because U+B7 is
part of \p{ID_Continue} (hence the lib/utf8.t failure). One thing my
patch didn’t address was the \pM? (is_utf8_mark) part of scan_word. \p
{ID_Continue} contains all of \pM except for the thirteen characters
in \p{Me}.
So there is a potential for breakage if we make everything match
Unicode. The macro handy.h is already explicitly looser than Unicode.
Fixing this bug requires an arbitrary decision from someone more
knowledgeable than I.
In Perl_yylex in toke.c:
switch (*s) {
default:
if (isIDFIRST_lazy_if(s,UTF))
goto keylookup;
isIDFIRST_lazy_if returns true for characters in ID_Continue that are
not digits. (see handy.h:
/* The ID_Start of Unicode is quite limiting: it assumes a L-class
* character (meaning that you cannot have, say, a CJK character).
* Instead, let's allow ID_Continue but not digits. */
#define isIDFIRST_utf8(p) (is_utf8_idcont(p) && !is_utf8_digit(p))
)
Then further down (in toke.c):
keylookup: {
...(8 lines snipped)...
s = scan_word(s, PL_tokenbuf, sizeof PL_tokenbuf, FALSE, &len);
S_scan_word has:
else if (UTF && UTF8_IS_START(*s) && isALNUM_utf8((U8*)s)) {
So characters in \p{OtherIDContinue}, such as U+387 and U+1369, get
treated as the first char of a keyword by isIDFIRST_lazy_if, but
scan_word rejects them and does not advance, since it doesn’t use
isIDFIRST_lazy_if except after a ‘'’. So we have an infinite number of
zero-length keywords....
Thanks for finding this. I've wondered about the comment in handy.h
that you quoted that documents that we decided to use a Perl home-grown
version of this rather than the official Unicode one. To repeat, it is
/* The ID_Start of Unicode is quite limiting: it assumes a L-class
* character (meaning that you cannot have, say, a CJK character).
* Instead, let's allow ID_Continue but not digits. */
Jarkko wrote that comment in 2002. Since then (actually quite a long
time ago), Unicode has fixed this problem, and the official ID_Start
does include Han characters and Korean syllables.
Jarkko wrote me last year that "Unicode knows best". In other words,
they will, in general, do a better job than people at Perl could
possibly do at figuring out what's best. They haven't always, but it's
getting better as the Standard has evolved, and is stabilizing over
time. They've put more and more things into place to minimize errors,
but that's not to say those have gone to zero.
In 5.12, I took Jarkko's advice, and changed our definitions of \p
properties to be identical to Unicode's. And people agreed with that
decision, so that's what got shipped.
I had been planning to look at this area too, and your posts spurred me
to do it. What I think is that we should move to Unicode's definitions,
even if it means breaking some existing code. Going forward, then, we
won't have to worry about it, as those definitions get added to (and
perhaps modified); we just follow the Standard.
The middle dot that caused your test to fail is one that Unicode has had
some issues with knowing how to handle. I haven't checked if it has
changed with regard to this property, but it has in others. But that
has settled down, and remained unchanged in recent releases.
Actually, I think we should move not to ID_Start, but to Unicode's
revised property, XID_Start which they recommend over the earlier one,
and is nearly identical, but better handles a few weirdly behaving
characters, in Thai, Lao, Greek, and Arabic mostly. 5.12's regex \X
construct uses a similar Unicode definition that takes these into
account, and it automatically fixes the issue with marks that you
pointed out.
Unicode is keeping ID_Start around for backwards compatibility. I don't
know if they intend to do so indefinitely or not. My guess is that it
will be there for quite some time to come.
To summarize, I propose that we use Unicode's XID_Start and XID_Continue
properties in 5.14, even though that breaks one of our tests, and
possibly existing code.
Father Chrysostomos wrote:
>
> On Apr 27, 2010, at 6:56 PM, karl williamson wrote:
>
>> To summarize, I propose that we use Unicode's XID_Start and
>> XID_Continue properties in 5.14, even though that breaks one of our
>> tests, and possibly existing code.
>
> Would we change the meanings of is_utf8_idcont and is_utf8_idfirst, or
> introduce new functions?
My first take is that I think we would just change the meanings. The
differences are quite minimal. ID_Start contains 23 more characters
than XID_Start:
037A GREEK YPOGEGRAMMENI
0E33 THAI CHARACTER SARA AM
0EB3 LAO VOWEL SIGN AM
309B KATAKANA-HIRAGANA VOICED SOUND MARK
309C KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK
FC5E ARABIC LIGATURE SHADDA WITH DAMMATAN ISOLATED FORM
FC5F ARABIC LIGATURE SHADDA WITH KASRATAN ISOLATED FORM
FC60 ARABIC LIGATURE SHADDA WITH FATHA ISOLATED FORM
FC61 ARABIC LIGATURE SHADDA WITH DAMMA ISOLATED FORM
FC62 ARABIC LIGATURE SHADDA WITH KASRA ISOLATED FORM
FC63 ARABIC LIGATURE SHADDA WITH SUPERSCRIPT ALEF ISOLATED FORM
FDFA ARABIC LIGATURE SALLALLAHOU ALAYHE WASALLAM
FDFB ARABIC LIGATURE JALLAJALALOUHOU
FE70 ARABIC FATHATAN ISOLATED FORM
FE72 ARABIC DAMMATAN ISOLATED FORM
FE74 ARABIC KASRATAN ISOLATED FORM
FE76 ARABIC FATHA ISOLATED FORM
FE78 ARABIC DAMMA ISOLATED FORM
FE7A ARABIC KASRA ISOLATED FORM
FE7C ARABIC SHADDA ISOLATED FORM
FE7E ARABIC SUKUN ISOLATED FORM
FF9E HALFWIDTH KATAKANA VOICED SOUND MARK
FF9F HALFWIDTH KATAKANA SEMI-VOICED SOUND MARK
And ID_Continue contains 19 more characters than XID_Continue:
037A GREEK YPOGEGRAMMENI
309B KATAKANA-HIRAGANA VOICED SOUND MARK
309C KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK
FC5E ARABIC LIGATURE SHADDA WITH DAMMATAN ISOLATED FORM
FC5F ARABIC LIGATURE SHADDA WITH KASRATAN ISOLATED FORM
FC60 ARABIC LIGATURE SHADDA WITH FATHA ISOLATED FORM
FC61 ARABIC LIGATURE SHADDA WITH DAMMA ISOLATED FORM
FC62 ARABIC LIGATURE SHADDA WITH KASRA ISOLATED FORM
FC63 ARABIC LIGATURE SHADDA WITH SUPERSCRIPT ALEF ISOLATED FORM
FDFA ARABIC LIGATURE SALLALLAHOU ALAYHE WASALLAM
FDFB ARABIC LIGATURE JALLAJALALOUHOU
FE70 ARABIC FATHATAN ISOLATED FORM
FE72 ARABIC DAMMATAN ISOLATED FORM
FE74 ARABIC KASRATAN ISOLATED FORM
FE76 ARABIC FATHA ISOLATED FORM
FE78 ARABIC DAMMA ISOLATED FORM
FE7A ARABIC KASRA ISOLATED FORM
FE7C ARABIC SHADDA ISOLATED FORM
FE7E ARABIC SUKUN ISOLATED FORM
So the differences are minimal; we would be recognizing 23 or 19 fewer
characters by going with the X versions. You can tell from some of the
names why it was wrong to put them in the original versions.
But I need to further study things to come up with a recommendation
>
> In anticipation of this change, I�ve attached a patch that corrects the
> test in utf8.t to use � instead of �. I�ve also moved the test outside
> of the eval, so it will still run (and fail) if the compilation fails,
> instead of causing an invalid test count.
>
Thanks. Have you considered adding a timeout? test.pl has one that will
kill the test script if Perl hangs.
> To summarize, I propose that we use Unicode's XID_Start and
> XID_Continue properties in 5.14, even though that breaks one of our
> tests, and possibly existing code.
Would we change the meanings of is_utf8_idcont and is_utf8_idfirst, or
introduce new functions?
In anticipation of this change, I’ve attached a patch that corrects
the test in utf8.t to use ¡ instead of ·. I’ve also moved the test
That sounds good to me.
Since this causes Perl to hang, I think it should be addressed somehow in 5.12.1. It may be that the thing to do is just document it.
It's been around since 2007. I'm still looking at how things are done currently, and a number of things appear wrong to me, but that's an initial take, subject to further consideration.
Father Chrysostomos wrote:On Apr 27, 2010, at 6:56 PM, karl williamson wrote:To summarize, I propose that we use Unicode's XID_Start and XID_Continue properties in 5.14, even though that breaks one of our tests, and possibly existing code.Would we change the meanings of is_utf8_idcont and is_utf8_idfirst, or introduce new functions?
But I need to further study things to come up with a recommendation
In anticipation of this change, I’ve attached a patch that corrects the test in utf8.t to use ¡ instead of ·. I’ve also moved the test outside of the eval, so it will still run (and fail) if the compilation fails, instead of causing an invalid test count.
Thanks. Have you considered adding a timeout? test.pl has one that will kill the test script if Perl hangs.
I’ve applied this test patch (but not the fix for the original bug
reported) as 7b301413cce02b9a948a0e223b4f6a6c0112f1c1.
I had thought some about this and concluded that the only way to
guarantee things without breaking any backward compatibility is to make
the continuation consistent with our own screwed up definition of first.
But I haven't gone further on it. We can also change the tokenizer to
not hang but die instead if caught in a loop, but that's not the root cause.
I have thought about it a bit now. The problem with allowing Unicode
alphanumerics in identifiers is that Perl’s syntax changes subtly over
time. (So I do not think the ‘Unicode knows best’ rule, though
appropriate for \p, can apply to parsing.) This would not be a problem
if we only ever added to the allowed characters, but non-identifier
characters are allowed as delimiters. This means that we subtract from
the latter whenever we add to the former. We obviously cannot undo this
now, but we can avoid changing it more than necessary. To that end, I
have fixed the looping with commit d7425188, but excluding from
isIDFIRST_utf8 the characters that were looping. No other characters
have been affected, so this is 100% backward-compatible (until the next
Unicode upgrade :-( ).
To make sure we don’t change q·foo· unknowingly, I’ve changed the test
back with commit ab10b0e544.
(As a side note, ECMAScript allows Unicode characters in identifiers,
but that set only ever grows. Other non-ASCII characters are forbidden
outside of literals and comments, so the shrinking set problem does not
occur. CSS allows any non-ASCII characters in identifiers.)
As someone who has within the last 24 hours spent a bit of time
in the Perl debugger with Unicode alphanumeric identifiers, I
can tell you it does *not* work very well.
This command:
% echo "ácütê" | perl -CS -d -S leo
Yields this kinda of garbage:
Loading DB routines from perl5db.pl version 1.33
Editor support available.
Enter h or `h h' for help, or `man perldebug' for more help.
main::(/Users/tomchristiansen/scripts/leo:38):
38: main();
DB<1> b flip_diacriticals
DB<2> c
main::flip_diacriticals(/Users/tomchristiansen/scripts/leo:135):
135: binmode(DATA, ":utf8");
DB<2> T
Wide character in print at /usr/local/lib/perl5/5.12.2/perl5db.pl line 6789, <> line 1.
Wide character in print at /usr/local/lib/perl5/5.12.2/perl5db.pl line 5724, <> line 1.
at /usr/local/lib/perl5/5.12.2/perl5db.pl line 5724
DB::print_trace('GLOB(0x253060)', 1) called at /usr/local/lib/perl5/5.12.2/perl5db.pl line 2842
DB::DB called at /Users/tomchristiansen/scripts/leo line 135
main::flip_diacriticals('êtücá') called at /Users/tomchristiansen/scripts/leo line 121
main::reverse_mark_flip('ácütê') called at /Users/tomchristiansen/scripts/leo line 57
main::uÊ opÉ™pᴉƨdn('ácütê') called at /Users/tomchristiansen/scripts/leo line 47
main::main() called at /Users/tomchristiansen/scripts/leo line 38
$ = main::flip_diacriticals('êtücá') called from file `/Users/tomchristiansen/scripts/leo' line 121
$ = main::reverse_mark_flip('M-acM-|tM-j') called from file `/Users/tomchristiansen/scripts/leo' line 57
$ = main::uʍopəpᴉƨdn('M-acM-|tM-j') called from file `/Users/tomchristiansen/scripts/leo' line 47
. = main::main() called from file `/Users/tomchristiansen/scripts/leo' line 38
I've used -CS on the command line, and I've even used it
before -d. I have PERL_UNICODE=SA in my shell. My program
starts like this:
use 5.010_000;
use utf8;
use strict;
use autodie;
use warnings qw[ FATAL all ];
use open qw[ :std :utf8 ];
I can't think of anything else to do.
Oh wait. Yes, I can!
% echo "ácütê" | perl -CS -d -S leo
Loading DB routines from perl5db.pl version 1.33
Editor support available.
Enter h or `h h' for help, or `man perldebug' for more help.
main::(/Users/tomchristiansen/scripts/leo:38):
38: main();
DB<1> binmode(DB::OUT, ":utf8") || die
DB<2> b flip_diacriticals
DB<3> c
main::flip_diacriticals(/Users/tomchristiansen/scripts/leo:135):
135: binmode(DATA, ":utf8");
DB<3> T
$ = main::flip_diacriticals('êtücá') called from file `/Users/tomchristiansen/scripts/leo' line 121
$ = main::reverse_mark_flip('M-acM-|tM-j') called from file `/Users/tomchristiansen/scripts/leo' line 57
$ = main::uÊ opÉ™pᴉƨdn('M-acM-|tM-j') called from file `/Users/tomchristiansen/scripts/leo' line 47
. = main::main() called from file `/Users/tomchristiansen/scripts/leo' line 38
See, it's still garbage!
What am I supposed to do?
And watch this:
DB<3> b main::uÊ opÉ™pᴉƨdn
Subroutine main::u not found.
That was entered as
b main::<TAB>
and it completed to that CRAP. Heck, even when I type
b main::uʍopəpᴉƨdn
it ignores me and displays
b main::uÊ opÉ™pᴉƨdn
and then again bitches about
Subroutine main::u not found.
To add injury to insult, that's illegal UTF-8 up there in its output!
What am I supposed to do about *that*?
It's just totally bollocksed, is what it is. :(
--tom
Let me be more clear. Perl works perfectly well. So does
my program. It's just the Perl debugger that's broken.
--tom
Could you send your previous message to per...@perl.org without the
[perl #...] in the subject?