On Tue, Dec 09, 2003 at 06:04:33PM +0000, David Greaves wrote:
> 73 32 65 105 110 194 180 116
> 73 32 65 105 110 194 180 116
So strings contain identical characters, but do not behave
identically. ==> Perl bug. FreezeThaw uses
substr($foo, $bar) =~ /^abc/;
and this construct may have missed the Perl test suite (if RH9 people
ran one when they released 5.8.0 ;-[). I cc to p5p.
The next step is to run
Dump($output)
Dump($input)
in the strings - after doing `use Devel::Peek'.
Thanks for your cooperation,
Ilya
$ cat tst1
#!perl -w
use Devel::Peek;
use FreezeThaw qw(freeze thaw);
my $var = "I Ain\x{c2}\x{b4}t";
print STDERR ord, " " for split //, $var;
print STDERR "\n";
#print freeze($var);
print $var;
print STDERR Dump($var);
$ cat tst2
#!perl -w
use Devel::Peek;
use FreezeThaw qw(freeze thaw);
my $enc = <>;
chomp $enc;
print STDERR ord, " " for split //, $enc;
print STDERR "\n";
print STDERR Dump($enc);
my ($var1)=thaw($enc);
print $var1;
$ echo $LANG
en_GB.UTF-8
$ perl tst1 | perl tst2
73 32 65 105 110 194 180 116
SV = PV(0x804cbf8) at 0x8060234
REFCNT = 1
FLAGS = (PADBUSY,PADMY,POK,pPOK)
PV = 0x805cda8 "I Ain\302\264t"\0
CUR = 8
LEN = 9
73 32 65 105 110 194 180 116
SV = PV(0x804cc7c) at 0x8060234
REFCNT = 1
FLAGS = (PADBUSY,PADMY,POK,pPOK,UTF8)
PV = 0x80856e8 "I Ain\303\202\302\264t"\0 [UTF8 "I Ain\x{c2}\x{b4}t"]
CUR = 10
LEN = 80
Signature not present, continuing anyway at
/usr/lib/perl5/site_perl/5.8.0/FreezeThaw.pm line 661, <> line 1.
Do not know how to thaw data with code `I' at
/usr/lib/perl5/site_perl/5.8.0/FreezeThaw.pm line 542
FreezeThaw::thawScalar(0) called at
/usr/lib/perl5/site_perl/5.8.0/FreezeThaw.pm line 679
FreezeThaw::thaw('I Ain\x{c2}\x{b4}t') called at tst2 line 9
in tst1:
-#print freeze($var);
-print $var;
+print freeze($var);
+#print $var;
$ LANG=C
$ perl tst1 | perl tst2
73 32 65 105 110 194 180 116
SV = PV(0x804cbf8) at 0x805f8c8
REFCNT = 1
FLAGS = (PADBUSY,PADMY,POK,pPOK)
PV = 0x805c490 "I Ain\302\264t"\0
CUR = 8
LEN = 9
70 114 84 59 64 49 124 36 56 124 73 32 65 105 110 194 180 116
SV = PV(0x804cc7c) at 0x805f8c8
REFCNT = 1
FLAGS = (PADBUSY,PADMY,POK,pPOK)
PV = 0x8066ac8 "FrT;@1|$8|I Ain\302\264t"\0
CUR = 18
LEN = 80
I Ain´t
David
To resolve your immediate problem, set LC_ALL=C in your environment.
5.8.0 used to treat all file handles as UTF-8 if the locales were
xx_XX.UTF-8 (and consequently badly scramble binary data).
That has been corrected in perl >= 5.8.1.
But your example seems to be shaking out a real bug in the Perl core -
involving UTF-8 strings and substr().
More on that hopefully soon.
Regards,
Adi
Yes, this is why I'm persisting in investigating this. I can
reproduce it with 5.8.2, so it is not a fault of RH.
perl -MDevel::Peek -MFreezeThaw=freeze,thaw -wle "$a=freeze qq(bc\xa8d); print $a; $a .= chr 567; chop $a; Dump $a; print $a; print thaw $a"
Is somebody going to look at this? The CORE example is
perl -MDevel::Peek -wle "$a=qq(bzc\xa8d); print $a; $a .= chr 567; chop $a; Dump $a; print substr $a, 2, 1; substr($a, 2) =~ /^/; Dump $a; print substr $a, 2, 1"
In short: doing substr() =~ /^/ on an utf8 string is ruining future
substr() (it has wrong length).
Hope this helps,
Ilya
For a RedHat value of "5.8.0" (to be fair, a Jarkko approved RedHat ...)
On Wed, Dec 10, 2003 at 12:34:39PM -0800, Ilya Zakharevich wrote:
> On Wed, Dec 10, 2003 at 01:21:11PM +0200, Enache Adrian wrote:
> > But your example seems to be shaking out a real bug in the Perl core -
> > involving UTF-8 strings and substr().
> > More on that hopefully soon.
>
> Yes, this is why I'm persisting in investigating this. I can
> reproduce it with 5.8.2, so it is not a fault of RH.
>
> perl -MDevel::Peek -MFreezeThaw=freeze,thaw -wle "$a=freeze qq(bc\xa8d); print $a; $a .= chr 567; chop $a; Dump $a; print $a; print thaw $a"
>
> Is somebody going to look at this? The CORE example is
I hope that Enache is, given what he wrong.
> perl -MDevel::Peek -wle "$a=qq(bzc\xa8d); print $a; $a .= chr 567; chop $a; Dump $a; print substr $a, 2, 1; substr($a, 2) =~ /^/; Dump $a; print substr $a, 2, 1"
>
> In short: doing substr() =~ /^/ on an utf8 string is ruining future
> substr() (it has wrong length).
>
> Hope this helps,
Yes, very much. Thanks for reducing it down to a pure core bug.
It looks like it's a bug introduced as part of Jarkko's post 5.8.0
UTF8 caching code. With vanilla 5.8.0:
$ perl5.8.0 -MDevel::Peek -wle '$a=qq(bzc\xa8d); print $a; $a .= chr 567; chop $a;Dump $a; print substr $a, 2, 1; substr($a, 2) =~ /^/; Dump $a; print substr $a,2, 1'
bzc¨d
SV = PV(0x8140ff8) at 0x814a6fc
REFCNT = 1
FLAGS = (POK,pPOK,UTF8)
PV = 0x814e038 "bzc\302\250d"\0 [UTF8 "bzc\x{a8}d"]
CUR = 6
LEN = 9
c
SV = PV(0x8140ff8) at 0x814a6fc
REFCNT = 1
FLAGS = (POK,pPOK,UTF8)
PV = 0x814e038 "bzc\302\250d"\0 [UTF8 "bzc\x{a8}d"]
CUR = 6
LEN = 9
c
With 5.8.1 (and 5.8.2)
bzc¨d
SV = PV(0x813ed98) at 0x8148b10
REFCNT = 1
FLAGS = (POK,pPOK,UTF8)
PV = 0x8153ff0 "bzc\302\250d"\0 [UTF8 "bzc\x{a8}d"]
CUR = 6
LEN = 9
c
SV = PVMG(0x8166d50) at 0x8148b10
REFCNT = 1
FLAGS = (SMG,POK,pPOK,UTF8)
IV = 0
NV = 0
PV = 0x8153ff0 "bzc\302\250d"\0 [UTF8 "bzc\x{a8}d"]
CUR = 6
LEN = 9
MAGIC = 0x8160060
MG_VIRTUAL = &PL_vtbl_utf8
MG_TYPE = PERL_MAGIC_utf8(w)
MG_LEN = 5
MG_PTR = 0x8160de8
0: 2 -> 2
1: 5 -> 4
c¨d
Andreas - can your binsearch easily find the patch where this changes from
ok to not ok?
perl -wle '$a=qq(bzc\xa8d) . chr 567; $b = substr ($a, 2, 1); substr($a, 2) =~ /^/; print $b eq substr ($a,2, 1) ? "ok" : "not ok"'
All I know is that it's somewhere between 5.8.0 and 5.8.1
Nicholas Clark
It will be fixed tomorrow.
Regards,
Adi
> [...binsearch...]
----Program----
$a=qq(bzc\xa8d);
print $a;
$a .= chr 567;
chop $a;
# Dump $a;
print substr $a, 2, 1;
substr($a, 2) =~ /^/;
# Dump $a;
print substr $a,2, 1;
----Output of .../pjJua9F/perl-5.8.0@18529/bin/perl----
bzc¨dcc
----EOF ($?='0')----
----Output of .../pF0YsxJ/perl-5.8.0@18530/bin/perl----
bzc¨dcc¨d
----EOF ($?='0')----
--
andreas
A simple example - it should print a single 'x':
$ perl -CO -e '$y = "x"x 25 . chr 345; substr $y, 2; print substr $y, 8, 1'
xxxxxxxxx
The second substr() was using the cached utf-8/byte length from
the first (see Perl_sv_pos_u2b()).
The following patch (applied to blead as #21875) should fix it for now.
Regards,
Adi
--- /arc/bleadperl/sv.c Mon Dec 8 06:51:20 2003
+++ ./sv.c Thu Dec 11 18:00:47 2003
@@ -5735,10 +5735,8 @@ S_utf8_mg_pos_init(pTHX_ SV *sv, MAGIC *
bool found = FALSE;
if (SvMAGICAL(sv) && !SvREADONLY(sv)) {
- if (!*mgp) {
- sv_magic(sv, 0, PERL_MAGIC_utf8, 0, 0);
- *mgp = mg_find(sv, PERL_MAGIC_utf8);
- }
+ if (!*mgp)
+ *mgp = sv_magicext(sv, 0, PERL_MAGIC_utf8, &PL_vtbl_utf8, 0, 0);
assert(*mgp);
if ((*mgp)->mg_ptr)
@@ -5831,6 +5829,12 @@ S_utf8_mg_pos(pTHX_ SV *sv, MAGIC **mgp,
/* Update the cache. */
(*cachep)[i] = (STRLEN)uoff;
(*cachep)[i+1] = p - start;
+
+ /* Drop the stale "length" cache */
+ if (i == 0) {
+ (*cachep)[2] = 0;
+ (*cachep)[3] = 0;
+ }
found = TRUE;
}
--- /arc/bleadperl/t/op/substr.t Mon Aug 4 02:43:17 2003
+++ ./t/op/substr.t Thu Dec 11 17:40:35 2003
@@ -1,6 +1,6 @@
#!./perl
-print "1..176\n";
+print "1..177\n";
#P = start of string Q = start of substr R = end of substr S = end of string
@@ -601,4 +601,11 @@ ok 174, $x eq "\x{100}\x{200}\xFFb";
}
my $x = my $y = 'AB'; ss $x; ss $y;
ok 176, $x eq $y;
+}
+
+# [perl #24605]
+{
+ my $x = "0123456789\x{500}";
+ my $y = substr $x, 4;
+ ok 177, substr($x, 7, 1) eq "7";
}
Thanks. I'm glad I didn't end up being the one hunting for this.
"for now" - what are you thinking comes later?
> @@ -5831,6 +5829,12 @@ S_utf8_mg_pos(pTHX_ SV *sv, MAGIC **mgp,
> /* Update the cache. */
> (*cachep)[i] = (STRLEN)uoff;
> (*cachep)[i+1] = p - start;
> +
> + /* Drop the stale "length" cache */
> + if (i == 0) {
> + (*cachep)[2] = 0;
> + (*cachep)[3] = 0;
> + }
Shouldn't there be a whole class of bugs fixed by this?
Basically anything calling S_utf8_mg_pos with i = 2 after a previous substr
Looking at S_utf8_mg_pos I can't see that the line
STRLEN ulen = sv_len_utf8(sv);
is optimally efficient, given that it will do a linear walk through of the
byte string to find the end, only for the next part of S_utf8_mg_pos
to walk linearly to find what the original request was looking for.
Nicholas Clark
Somehow I can't grasp the magnitude of this... Do you say that during
a year one could use substr() on any piece of data coming from file
input only once? And in presence of certain sequences of 8-bit chars
in input the second substr() will always return junk?
And that the error of this frequency was not detected until this report?
I can't believe my eyes...
Ilya
Strictly no, as this problem is a bug in the UTF8 offset caching code,
and that was released in 5.8.1 this September.
Not sure when RedHat and other distributions started shipping perl "5.8.0"
with some of 5.8.1's features ported into it.
The data also has to be utf8, not 8 bit.
(or at least internally utf8 encoded)
> And that the error of this frequency was not detected until this report?
It's not that frequent. Else it would have been detected.
The problems with corruption of $1 etc when the first regexp run caused
the regexp Engine's internal utf8 loading code occurred far more frequently,
and were reported several times.
I infer that *They* weren't detected because the bug only happened on the
first regexp, and all the regression tests run in order in 1 perl
interpreter, and tend to have utf8 tests later rather than at the start.
Nicholas Clark
That's the itch - it has to be UTF-8 flagged, not encoded.
Perl doesn't check if it's valid UTF-8. substr(), index(), etc will
happily work on bogus 8 bit data thinking it's UTF-8.
Regards,
Adi
I'm sorry, I meant "on RH9, thus during a year", not just "during a year".
> The data also has to be utf8, not 8 bit.
> (or at least internally utf8 encoded)
IIUC, on RH9 all the data is utf8 by default...
Yours,
Ilya