In the following code snippet,
#!/usr/local/bin/perl -lw
use strict;
use Encode 'decode';
use Lingua::JA::FindDates 'subsjdate';
binmode STDERR,"utf8";
binmode STDOUT,"utf8";
print STDERR "first try\n";
my $test = "ABCDEFG";
print subsjdate($test);
print STDERR "now try again\n";
$test = decode ('utf8', $test);
print subsjdate($test);
the output is like this:
ben ~ 541 $ ./test2.pl
first try
Use of uninitialized value in pattern match (m//) at /usr/local/lib/perl5/
site_perl/5.10.0/Lingua/JA/FindDates.pm line 531.
Use of uninitialized value in pattern match (m//) at /usr/local/lib/perl5/
site_perl/5.10.0/Lingua/JA/FindDates.pm line 531.
Use of uninitialized value in pattern match (m//) at /usr/local/lib/perl5/
site_perl/5.10.0/Lingua/JA/FindDates.pm line 531.
Use of uninitialized value in pattern match (m//) at /usr/local/lib/perl5/
site_perl/5.10.0/Lingua/JA/FindDates.pm line 531.
Use of uninitialized value in pattern match (m//) at /usr/local/lib/perl5/
site_perl/5.10.0/Lingua/JA/FindDates.pm line 531.
Use of uninitialized value in pattern match (m//) at /usr/local/lib/perl5/
site_perl/5.10.0/Lingua/JA/FindDates.pm line 531.
Use of uninitialized value in pattern match (m//) at /usr/local/lib/perl5/
site_perl/5.10.0/Lingua/JA/FindDates.pm line 531.
Use of uninitialized value in pattern match (m//) at /usr/local/lib/perl5/
site_perl/5.10.0/Lingua/JA/FindDates.pm line 531.
Use of uninitialized value in pattern match (m//) at /usr/local/lib/perl5/
site_perl/5.10.0/Lingua/JA/FindDates.pm line 531.
Use of uninitialized value in pattern match (m//) at /usr/local/lib/perl5/
site_perl/5.10.0/Lingua/JA/FindDates.pm line 531.
ABCDEFG
now try again
ABCDEFG
ben ~ 542 $
But, if I
use utf8;
and call the routine with a non-ascii string, like 平成, I don't get the
error messages.
What's more, after about one hour of exhaustive checking, I'm fairly sure
that there is no uninitialized value in the pattern match in question. In
fact I can remove the error message by removing a variable which is
initialized, called $kanjidigits, from the pattern match, but that seems
even more weird.
I think the above-described behaviour, regardless of any errors in the
module, indicates an error in Perl. Also, I think there is nothing wrong
with the module. Does anybody have any other opinions?
Right. Your problem can be reproduced with this script:
#!/usr/bin/perl
use warnings;
use strict;
my $regex =
"([\x{ff10}-\x{ff19}0-9]{4}|[\x{5341}\x{516d}\x{4e03}\x{4e5d}\x{4e94}\x{56db}\x{5343}\x{767e}\x{4e8c}\x{4e00}\x{516b}\x{4e09}]?\x{5343}[\x{5341}\x{516d}\x{4e03}\x{4e5d}\x{4e94}\x{56db}\x{5343}\x{767e}\x{4e8c}\x{4e00}\x{516b}\x{4e09}]*)\\s*\x{5e74}";
my $test = "ABCDEFG";
if ($test =~ /($regex)/) {
print "m:<$1>\n";
}
__END__
If the last character ("\x{5e74}") is removed from the regexp, the
warning vanishes. But if the capturing () is removed (leaving just
"\\s*\x{5e74}", the warning vanishes, too - so it's not just \x{5e74}
which triggers the warning, only that combined with something else.
hp
Using utf8 in regexen is not well-supported in 5.8; in particular, the
regex engine is not consistent about when to apply utf8 semantics and
when to apply byte semantics. Some of the bugs have been fixed in 5.10;
I don't know if they all have.
Ben
--
For far more marvellous is the truth than any artists of the past imagined!
Why do the poets of the present not speak of it? What men are poets who can
speak of Jupiter if he were like a man, but if he is an immense spinning
sphere of methane and ammonia must be silent? [Feynmann] b...@morrow.me.uk
> Quoth "Peter J. Holzer" <hjp-u...@hjp.at>:
>> On 2008-07-13 14:14, Ben Bullock <benkasmi...@gmail.com> wrote:
>> > I've found a place where Perl seems to behave differently depending
on
>> > whether something is marked as UTF-8 or not, regardless of the fact
that
>> > it is just ASCII.
>>
>> Right. Your problem can be reproduced with this script:
>>
>> #!/usr/bin/perl
>> use warnings;
>> use strict;
>>
>> my $regex =
>> "([\x{ff10}-\x{ff19}0-9]{4}|[\x{5341}\x{516d}\x{4e03}\x{4e5d}\x{4e94}\x
{56db}\x{5343}\x{767e}\x{4e8c}\x{4e00}\x{516b}\x{4e09}]?\x{5343}[\x{5341}
\x{516d}\x{4e03}\x{4e5d}\x{4e94}\x{56db}\x{5343}\x{767e}\x{4e8c}\x{4e00}\x
{516b}\x{4e09}]*)\\s*\x{5e74}";
>
> Using utf8 in regexen is not well-supported in 5.8; in particular, the
> regex engine is not consistent about when to apply utf8 semantics and
> when to apply byte semantics. Some of the bugs have been fixed in 5.10;
> I don't know if they all have.
The problem I described is the behaviour of Perl 5.10:
ben ~ 501 $ perl --version
This is perl, v5.10.0 built for i686-linux
Copyright 1987-2007, Larry Wall
Perl may be copied only under the terms of either the Artistic License or
the
GNU General Public License, which may be found in the Perl 5 source kit.
Complete documentation for Perl, including FAQ lists, should be found on
this system using "man perl" or "perldoc perl". If you have access to the
Internet, point your browser at http://www.perl.org/, the Perl Home Page.
ben ~ 502 $ ben ~ 502 $ ./test2.pl
first try
Use of uninitialized value in pattern match (m//) at /usr/local/lib/perl5/
site_perl/5.10.0/Lingua/JA/FindDates.pm line 531.
etc.
Should I report this as a bug?
> Should I report this as a bug?
Never mind, I reported it anyway.