Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Apparent bug in Perl 5.10 regexes w. UTF-8 expression

0 views
Skip to first unread message

Ben Bullock

unread,
Jul 13, 2008, 10:14:06 AM7/13/08
to
I've found a place where Perl seems to behave differently depending on
whether something is marked as UTF-8 or not, regardless of the fact that
it is just ASCII.

In the following code snippet,

#!/usr/local/bin/perl -lw
use strict;
use Encode 'decode';
use Lingua::JA::FindDates 'subsjdate';
binmode STDERR,"utf8";
binmode STDOUT,"utf8";
print STDERR "first try\n";
my $test = "ABCDEFG";
print subsjdate($test);
print STDERR "now try again\n";
$test = decode ('utf8', $test);
print subsjdate($test);

the output is like this:

ben ~ 541 $ ./test2.pl
first try

Use of uninitialized value in pattern match (m//) at /usr/local/lib/perl5/
site_perl/5.10.0/Lingua/JA/FindDates.pm line 531.
Use of uninitialized value in pattern match (m//) at /usr/local/lib/perl5/
site_perl/5.10.0/Lingua/JA/FindDates.pm line 531.
Use of uninitialized value in pattern match (m//) at /usr/local/lib/perl5/
site_perl/5.10.0/Lingua/JA/FindDates.pm line 531.
Use of uninitialized value in pattern match (m//) at /usr/local/lib/perl5/
site_perl/5.10.0/Lingua/JA/FindDates.pm line 531.
Use of uninitialized value in pattern match (m//) at /usr/local/lib/perl5/
site_perl/5.10.0/Lingua/JA/FindDates.pm line 531.
Use of uninitialized value in pattern match (m//) at /usr/local/lib/perl5/
site_perl/5.10.0/Lingua/JA/FindDates.pm line 531.
Use of uninitialized value in pattern match (m//) at /usr/local/lib/perl5/
site_perl/5.10.0/Lingua/JA/FindDates.pm line 531.
Use of uninitialized value in pattern match (m//) at /usr/local/lib/perl5/
site_perl/5.10.0/Lingua/JA/FindDates.pm line 531.
Use of uninitialized value in pattern match (m//) at /usr/local/lib/perl5/
site_perl/5.10.0/Lingua/JA/FindDates.pm line 531.
Use of uninitialized value in pattern match (m//) at /usr/local/lib/perl5/
site_perl/5.10.0/Lingua/JA/FindDates.pm line 531.
ABCDEFG
now try again

ABCDEFG
ben ~ 542 $

But, if I

use utf8;

and call the routine with a non-ascii string, like 平成, I don't get the
error messages.

What's more, after about one hour of exhaustive checking, I'm fairly sure
that there is no uninitialized value in the pattern match in question. In
fact I can remove the error message by removing a variable which is
initialized, called $kanjidigits, from the pattern match, but that seems
even more weird.

I think the above-described behaviour, regardless of any errors in the
module, indicates an error in Perl. Also, I think there is nothing wrong
with the module. Does anybody have any other opinions?

Peter J. Holzer

unread,
Jul 13, 2008, 12:03:21 PM7/13/08
to
On 2008-07-13 14:14, Ben Bullock <benkasmi...@gmail.com> wrote:
> I've found a place where Perl seems to behave differently depending on
> whether something is marked as UTF-8 or not, regardless of the fact that
> it is just ASCII.
>
> In the following code snippet,
>
> #!/usr/local/bin/perl -lw
> use strict;
> use Encode 'decode';
> use Lingua::JA::FindDates 'subsjdate';
> binmode STDERR,"utf8";
> binmode STDOUT,"utf8";
> print STDERR "first try\n";
> my $test = "ABCDEFG";
> print subsjdate($test);
> print STDERR "now try again\n";
> $test = decode ('utf8', $test);
> print subsjdate($test);
>
> the output is like this:
>
> ben ~ 541 $ ./test2.pl
> first try
>
> Use of uninitialized value in pattern match (m//) at /usr/local/lib/perl5/
> site_perl/5.10.0/Lingua/JA/FindDates.pm line 531.
[...]

> What's more, after about one hour of exhaustive checking, I'm fairly sure
> that there is no uninitialized value in the pattern match in question.

Right. Your problem can be reproduced with this script:

#!/usr/bin/perl
use warnings;
use strict;

my $regex =
"([\x{ff10}-\x{ff19}0-9]{4}|[\x{5341}\x{516d}\x{4e03}\x{4e5d}\x{4e94}\x{56db}\x{5343}\x{767e}\x{4e8c}\x{4e00}\x{516b}\x{4e09}]?\x{5343}[\x{5341}\x{516d}\x{4e03}\x{4e5d}\x{4e94}\x{56db}\x{5343}\x{767e}\x{4e8c}\x{4e00}\x{516b}\x{4e09}]*)\\s*\x{5e74}";
my $test = "ABCDEFG";
if ($test =~ /($regex)/) {
print "m:<$1>\n";
}
__END__

If the last character ("\x{5e74}") is removed from the regexp, the
warning vanishes. But if the capturing () is removed (leaving just
"\\s*\x{5e74}", the warning vanishes, too - so it's not just \x{5e74}
which triggers the warning, only that combined with something else.

hp

Ben Morrow

unread,
Jul 13, 2008, 2:46:14 PM7/13/08
to

Quoth "Peter J. Holzer" <hjp-u...@hjp.at>:

> On 2008-07-13 14:14, Ben Bullock <benkasmi...@gmail.com> wrote:
> > I've found a place where Perl seems to behave differently depending on
> > whether something is marked as UTF-8 or not, regardless of the fact that
> > it is just ASCII.
>
> Right. Your problem can be reproduced with this script:
>
> #!/usr/bin/perl
> use warnings;
> use strict;
>
> my $regex =
> "([\x{ff10}-\x{ff19}0-9]{4}|[\x{5341}\x{516d}\x{4e03}\x{4e5d}\x{4e94}\x{56db}\x{5343}\x{767e}\x{4e8c}\x{4e00}\x{516b}\x{4e09}]?\x{5343}[\x{5341}\x{516d}\x{4e03}\x{4e5d}\x{4e94}\x{56db}\x{5343}\x{767e}\x{4e8c}\x{4e00}\x{516b}\x{4e09}]*)\\s*\x{5e74}";

Using utf8 in regexen is not well-supported in 5.8; in particular, the
regex engine is not consistent about when to apply utf8 semantics and
when to apply byte semantics. Some of the bugs have been fixed in 5.10;
I don't know if they all have.

Ben

--
For far more marvellous is the truth than any artists of the past imagined!
Why do the poets of the present not speak of it? What men are poets who can
speak of Jupiter if he were like a man, but if he is an immense spinning
sphere of methane and ammonia must be silent? [Feynmann] b...@morrow.me.uk

Ben Bullock

unread,
Jul 13, 2008, 6:18:43 PM7/13/08
to
On Sun, 13 Jul 2008 19:46:14 +0100, Ben Morrow wrote:

> Quoth "Peter J. Holzer" <hjp-u...@hjp.at>:
>> On 2008-07-13 14:14, Ben Bullock <benkasmi...@gmail.com> wrote:
>> > I've found a place where Perl seems to behave differently depending
on
>> > whether something is marked as UTF-8 or not, regardless of the fact
that
>> > it is just ASCII.
>>
>> Right. Your problem can be reproduced with this script:
>>
>> #!/usr/bin/perl
>> use warnings;
>> use strict;
>>
>> my $regex =
>> "([\x{ff10}-\x{ff19}0-9]{4}|[\x{5341}\x{516d}\x{4e03}\x{4e5d}\x{4e94}\x
{56db}\x{5343}\x{767e}\x{4e8c}\x{4e00}\x{516b}\x{4e09}]?\x{5343}[\x{5341}
\x{516d}\x{4e03}\x{4e5d}\x{4e94}\x{56db}\x{5343}\x{767e}\x{4e8c}\x{4e00}\x
{516b}\x{4e09}]*)\\s*\x{5e74}";
>
> Using utf8 in regexen is not well-supported in 5.8; in particular, the
> regex engine is not consistent about when to apply utf8 semantics and
> when to apply byte semantics. Some of the bugs have been fixed in 5.10;
> I don't know if they all have.

The problem I described is the behaviour of Perl 5.10:

ben ~ 501 $ perl --version

This is perl, v5.10.0 built for i686-linux

Copyright 1987-2007, Larry Wall

Perl may be copied only under the terms of either the Artistic License or
the
GNU General Public License, which may be found in the Perl 5 source kit.

Complete documentation for Perl, including FAQ lists, should be found on
this system using "man perl" or "perldoc perl". If you have access to the
Internet, point your browser at http://www.perl.org/, the Perl Home Page.

ben ~ 502 $ ben ~ 502 $ ./test2.pl
first try

Use of uninitialized value in pattern match (m//) at /usr/local/lib/perl5/
site_perl/5.10.0/Lingua/JA/FindDates.pm line 531.

etc.

Should I report this as a bug?


Ben Bullock

unread,
Jul 13, 2008, 6:40:20 PM7/13/08
to
On Sun, 13 Jul 2008 22:18:43 +0000, Ben Bullock wrote:

> Should I report this as a bug?

Never mind, I reported it anyway.

0 new messages