[Fwd: Re: [perl #24605] FreezeThaw fails with non-ascii]

Ilya Zakharevich

unread,

Dec 9, 2003, 6:09:30 PM12/9/03

to David Greaves, Mailing list Perl5

This message is about FreezeThaw misbehaving when frozen data goes
through pipes on RH9 with 5.8.0. I asked to dump the characters in
the strings on both sides of the pipe:

On Tue, Dec 09, 2003 at 06:04:33PM +0000, David Greaves wrote:
> 73 32 65 105 110 194 180 116
> 73 32 65 105 110 194 180 116

So strings contain identical characters, but do not behave
identically. ==> Perl bug. FreezeThaw uses

substr($foo, $bar) =~ /^abc/;

and this construct may have missed the Perl test suite (if RH9 people
ran one when they released 5.8.0 ;-[). I cc to p5p.

The next step is to run

Dump($output)
Dump($input)

in the strings - after doing `use Devel::Peek'.

Thanks for your cooperation,
Ilya

David Greaves

unread,

Dec 10, 2003, 4:29:56 AM12/10/03

to Ilya Zakharevich, Mailing list Perl5

OK. Like this?

$ cat tst1
#!perl -w
use Devel::Peek;
use FreezeThaw qw(freeze thaw);
my $var = "I Ain\x{c2}\x{b4}t";
print STDERR ord, " " for split //, $var;
print STDERR "\n";
#print freeze($var);
print $var;
print STDERR Dump($var);

$ cat tst2
#!perl -w
use Devel::Peek;
use FreezeThaw qw(freeze thaw);
my $enc = <>;
chomp $enc;
print STDERR ord, " " for split //, $enc;
print STDERR "\n";
print STDERR Dump($enc);
my ($var1)=thaw($enc);
print $var1;

$ echo $LANG
en_GB.UTF-8

$ perl tst1 | perl tst2

73 32 65 105 110 194 180 116

SV = PV(0x804cbf8) at 0x8060234
REFCNT = 1
FLAGS = (PADBUSY,PADMY,POK,pPOK)
PV = 0x805cda8 "I Ain\302\264t"\0
CUR = 8
LEN = 9

73 32 65 105 110 194 180 116

SV = PV(0x804cc7c) at 0x8060234
REFCNT = 1
FLAGS = (PADBUSY,PADMY,POK,pPOK,UTF8)
PV = 0x80856e8 "I Ain\303\202\302\264t"\0 [UTF8 "I Ain\x{c2}\x{b4}t"]
CUR = 10
LEN = 80
Signature not present, continuing anyway at
/usr/lib/perl5/site_perl/5.8.0/FreezeThaw.pm line 661, <> line 1.
Do not know how to thaw data with code `I' at
/usr/lib/perl5/site_perl/5.8.0/FreezeThaw.pm line 542
FreezeThaw::thawScalar(0) called at
/usr/lib/perl5/site_perl/5.8.0/FreezeThaw.pm line 679
FreezeThaw::thaw('I Ain\x{c2}\x{b4}t') called at tst2 line 9

in tst1:
-#print freeze($var);
-print $var;
+print freeze($var);
+#print $var;

$ LANG=C

$ perl tst1 | perl tst2

73 32 65 105 110 194 180 116

SV = PV(0x804cbf8) at 0x805f8c8
REFCNT = 1
FLAGS = (PADBUSY,PADMY,POK,pPOK)
PV = 0x805c490 "I Ain\302\264t"\0
CUR = 8
LEN = 9
70 114 84 59 64 49 124 36 56 124 73 32 65 105 110 194 180 116
SV = PV(0x804cc7c) at 0x805f8c8
REFCNT = 1
FLAGS = (PADBUSY,PADMY,POK,pPOK)
PV = 0x8066ac8 "FrT;@1|$8|I Ain\302\264t"\0
CUR = 18
LEN = 80
I Ain´t

David

Enache Adrian

unread,

Dec 10, 2003, 6:21:11 AM12/10/03

to David Greaves, Ilya Zakharevich, Mailing list Perl5

On Wed, Dec 10, 2003 a.d., David Greaves wrote:
> OK. Like this?

To resolve your immediate problem, set LC_ALL=C in your environment.
5.8.0 used to treat all file handles as UTF-8 if the locales were
xx_XX.UTF-8 (and consequently badly scramble binary data).
That has been corrected in perl >= 5.8.1.

But your example seems to be shaking out a real bug in the Perl core -
involving UTF-8 strings and substr().
More on that hopefully soon.

Regards,
Adi

Ilya Zakharevich

unread,

Dec 10, 2003, 3:34:39 PM12/10/03

to David Greaves, Mailing list Perl5

On Wed, Dec 10, 2003 at 01:21:11PM +0200, Enache Adrian wrote:
> But your example seems to be shaking out a real bug in the Perl core -
> involving UTF-8 strings and substr().
> More on that hopefully soon.

Yes, this is why I'm persisting in investigating this. I can
reproduce it with 5.8.2, so it is not a fault of RH.

perl -MDevel::Peek -MFreezeThaw=freeze,thaw -wle "$a=freeze qq(bc\xa8d); print $a; $a .= chr 567; chop $a; Dump $a; print $a; print thaw $a"

Is somebody going to look at this? The CORE example is

perl -MDevel::Peek -wle "$a=qq(bzc\xa8d); print $a; $a .= chr 567; chop $a; Dump $a; print substr $a, 2, 1; substr($a, 2) =~ /^/; Dump $a; print substr $a, 2, 1"

In short: doing substr() =~ /^/ on an utf8 string is ruining future
substr() (it has wrong length).

Hope this helps,
Ilya

Nicholas Clark

unread,

Dec 10, 2003, 4:14:15 PM12/10/03

to Ilya Zakharevich, Andreas J. Koenig, David Greaves, Mailing list Perl5

On Tue, Dec 09, 2003 at 03:09:30PM -0800, Ilya Zakharevich wrote:
> This message is about FreezeThaw misbehaving when frozen data goes
> through pipes on RH9 with 5.8.0. I asked to dump the characters in
> the strings on both sides of the pipe:

For a RedHat value of "5.8.0" (to be fair, a Jarkko approved RedHat ...)

On Wed, Dec 10, 2003 at 12:34:39PM -0800, Ilya Zakharevich wrote:
> On Wed, Dec 10, 2003 at 01:21:11PM +0200, Enache Adrian wrote:
> > But your example seems to be shaking out a real bug in the Perl core -
> > involving UTF-8 strings and substr().
> > More on that hopefully soon.
>
> Yes, this is why I'm persisting in investigating this. I can
> reproduce it with 5.8.2, so it is not a fault of RH.
>
> perl -MDevel::Peek -MFreezeThaw=freeze,thaw -wle "$a=freeze qq(bc\xa8d); print $a; $a .= chr 567; chop $a; Dump $a; print $a; print thaw $a"
>
> Is somebody going to look at this? The CORE example is

I hope that Enache is, given what he wrong.

> perl -MDevel::Peek -wle "$a=qq(bzc\xa8d); print $a; $a .= chr 567; chop $a; Dump $a; print substr $a, 2, 1; substr($a, 2) =~ /^/; Dump $a; print substr $a, 2, 1"
>
> In short: doing substr() =~ /^/ on an utf8 string is ruining future
> substr() (it has wrong length).
>
> Hope this helps,

Yes, very much. Thanks for reducing it down to a pure core bug.

It looks like it's a bug introduced as part of Jarkko's post 5.8.0
UTF8 caching code. With vanilla 5.8.0:

$ perl5.8.0 -MDevel::Peek -wle '$a=qq(bzc\xa8d); print $a; $a .= chr 567; chop $a;Dump $a; print substr $a, 2, 1; substr($a, 2) =~ /^/; Dump $a; print substr $a,2, 1'
bzc¨d
SV = PV(0x8140ff8) at 0x814a6fc
REFCNT = 1
FLAGS = (POK,pPOK,UTF8)
PV = 0x814e038 "bzc\302\250d"\0 [UTF8 "bzc\x{a8}d"]
CUR = 6
LEN = 9
c
SV = PV(0x8140ff8) at 0x814a6fc
REFCNT = 1
FLAGS = (POK,pPOK,UTF8)
PV = 0x814e038 "bzc\302\250d"\0 [UTF8 "bzc\x{a8}d"]
CUR = 6
LEN = 9
c

With 5.8.1 (and 5.8.2)

bzc¨d
SV = PV(0x813ed98) at 0x8148b10
REFCNT = 1
FLAGS = (POK,pPOK,UTF8)
PV = 0x8153ff0 "bzc\302\250d"\0 [UTF8 "bzc\x{a8}d"]
CUR = 6
LEN = 9
c
SV = PVMG(0x8166d50) at 0x8148b10
REFCNT = 1
FLAGS = (SMG,POK,pPOK,UTF8)
IV = 0
NV = 0
PV = 0x8153ff0 "bzc\302\250d"\0 [UTF8 "bzc\x{a8}d"]
CUR = 6
LEN = 9
MAGIC = 0x8160060
MG_VIRTUAL = &PL_vtbl_utf8
MG_TYPE = PERL_MAGIC_utf8(w)
MG_LEN = 5
MG_PTR = 0x8160de8
0: 2 -> 2
1: 5 -> 4
c¨d

Andreas - can your binsearch easily find the patch where this changes from
ok to not ok?

perl -wle '$a=qq(bzc\xa8d) . chr 567; $b = substr ($a, 2, 1); substr($a, 2) =~ /^/; print $b eq substr ($a,2, 1) ? "ok" : "not ok"'

All I know is that it's somewhere between 5.8.0 and 5.8.1

Nicholas Clark

Enache Adrian

unread,

Dec 10, 2003, 4:25:10 PM12/10/03

to Ilya Zakharevich, David Greaves, Mailing list Perl5

On Wed, Dec 10, 2003 a.d., Ilya Zakharevich wrote:
> Is somebody going to look at this?

It will be fixed tomorrow.

Regards,
Adi

Andreas J Koenig

unread,

Dec 11, 2003, 9:41:22 AM12/11/03

to Nicholas Clark, Andreas J. Koenig, David Greaves, Mailing list Perl5

>>>>> On Wed, 10 Dec 2003 21:14:15 +0000, Nicholas Clark <ni...@ccl4.org> said:

> [...binsearch...]

----Program----

$a=qq(bzc\xa8d);
print $a;
$a .= chr 567;
chop $a;

# Dump $a;

print substr $a, 2, 1;
substr($a, 2) =~ /^/;

# Dump $a;
print substr $a,2, 1;

----Output of .../pjJua9F/perl-5.8.0@18529/bin/perl----
bzc¨dcc
----EOF ($?='0')----
----Output of .../pF0YsxJ/perl-5.8.0@18530/bin/perl----
bzc¨dcc¨d
----EOF ($?='0')----

--
andreas

Enache Adrian

unread,

Dec 11, 2003, 3:36:15 PM12/11/03

to Ilya Zakharevich, David Greaves, Mailing list Perl5

On Wed, Dec 10, 2003 a.d., Ilya Zakharevich wrote:
> perl -MDevel::Peek -MFreezeThaw=freeze,thaw -wle "$a=freeze qq(bc\xa8d); print $a; $a .= chr 567; chop $a; Dump $a; print $a; print thaw $a"

A simple example - it should print a single 'x':

$ perl -CO -e '$y = "x"x 25 . chr 345; substr $y, 2; print substr $y, 8, 1'
xxxxxxxxx

The second substr() was using the cached utf-8/byte length from
the first (see Perl_sv_pos_u2b()).
The following patch (applied to blead as #21875) should fix it for now.

Regards,
Adi

--- /arc/bleadperl/sv.c Mon Dec 8 06:51:20 2003
+++ ./sv.c Thu Dec 11 18:00:47 2003
@@ -5735,10 +5735,8 @@ S_utf8_mg_pos_init(pTHX_ SV *sv, MAGIC *
bool found = FALSE;

if (SvMAGICAL(sv) && !SvREADONLY(sv)) {
- if (!*mgp) {
- sv_magic(sv, 0, PERL_MAGIC_utf8, 0, 0);
- *mgp = mg_find(sv, PERL_MAGIC_utf8);
- }
+ if (!*mgp)
+ *mgp = sv_magicext(sv, 0, PERL_MAGIC_utf8, &PL_vtbl_utf8, 0, 0);
assert(*mgp);

if ((*mgp)->mg_ptr)
@@ -5831,6 +5829,12 @@ S_utf8_mg_pos(pTHX_ SV *sv, MAGIC **mgp,
/* Update the cache. */
(*cachep)[i] = (STRLEN)uoff;
(*cachep)[i+1] = p - start;
+
+ /* Drop the stale "length" cache */
+ if (i == 0) {
+ (*cachep)[2] = 0;
+ (*cachep)[3] = 0;
+ }

found = TRUE;
}
--- /arc/bleadperl/t/op/substr.t Mon Aug 4 02:43:17 2003
+++ ./t/op/substr.t Thu Dec 11 17:40:35 2003
@@ -1,6 +1,6 @@
#!./perl

-print "1..176\n";
+print "1..177\n";

#P = start of string Q = start of substr R = end of substr S = end of string

@@ -601,4 +601,11 @@ ok 174, $x eq "\x{100}\x{200}\xFFb";
}
my $x = my $y = 'AB'; ss $x; ss $y;
ok 176, $x eq $y;
+}
+
+# [perl #24605]
+{
+ my $x = "0123456789\x{500}";
+ my $y = substr $x, 4;
+ ok 177, substr($x, 7, 1) eq "7";
}

Nicholas Clark

unread,

Dec 11, 2003, 5:03:11 PM12/11/03

to Ilya Zakharevich, David Greaves, Mailing list Perl5

On Thu, Dec 11, 2003 at 10:36:15PM +0200, Enache Adrian wrote:
> On Wed, Dec 10, 2003 a.d., Ilya Zakharevich wrote:
> > perl -MDevel::Peek -MFreezeThaw=freeze,thaw -wle "$a=freeze qq(bc\xa8d); print $a; $a .= chr 567; chop $a; Dump $a; print $a; print thaw $a"
>
> A simple example - it should print a single 'x':
>
> $ perl -CO -e '$y = "x"x 25 . chr 345; substr $y, 2; print substr $y, 8, 1'
> xxxxxxxxx
>
> The second substr() was using the cached utf-8/byte length from
> the first (see Perl_sv_pos_u2b()).
> The following patch (applied to blead as #21875) should fix it for now.

Thanks. I'm glad I didn't end up being the one hunting for this.

"for now" - what are you thinking comes later?

> @@ -5831,6 +5829,12 @@ S_utf8_mg_pos(pTHX_ SV *sv, MAGIC **mgp,
> /* Update the cache. */
> (*cachep)[i] = (STRLEN)uoff;
> (*cachep)[i+1] = p - start;
> +
> + /* Drop the stale "length" cache */
> + if (i == 0) {
> + (*cachep)[2] = 0;
> + (*cachep)[3] = 0;
> + }

Shouldn't there be a whole class of bugs fixed by this?
Basically anything calling S_utf8_mg_pos with i = 2 after a previous substr

Looking at S_utf8_mg_pos I can't see that the line

STRLEN ulen = sv_len_utf8(sv);

is optimally efficient, given that it will do a linear walk through of the
byte string to find the end, only for the next part of S_utf8_mg_pos
to walk linearly to find what the original request was looking for.

Nicholas Clark

Ilya Zakharevich

unread,

Dec 12, 2003, 5:56:45 PM12/12/03

to David Greaves, Mailing list Perl5

On Thu, Dec 11, 2003 at 10:36:15PM +0200, Enache Adrian wrote:

> A simple example - it should print a single 'x':
>
> $ perl -CO -e '$y = "x"x 25 . chr 345; substr $y, 2; print substr $y, 8, 1'
> xxxxxxxxx

Somehow I can't grasp the magnitude of this... Do you say that during
a year one could use substr() on any piece of data coming from file
input only once? And in presence of certain sequences of 8-bit chars
in input the second substr() will always return junk?

And that the error of this frequency was not detected until this report?

I can't believe my eyes...

Ilya

Nicholas Clark

unread,

Dec 13, 2003, 8:55:51 AM12/13/03

to Ilya Zakharevich, David Greaves, Mailing list Perl5

On Fri, Dec 12, 2003 at 02:56:45PM -0800, Ilya Zakharevich wrote:
> On Thu, Dec 11, 2003 at 10:36:15PM +0200, Enache Adrian wrote:
> > A simple example - it should print a single 'x':
> >
> > $ perl -CO -e '$y = "x"x 25 . chr 345; substr $y, 2; print substr $y, 8, 1'
> > xxxxxxxxx
>
> Somehow I can't grasp the magnitude of this... Do you say that during
> a year one could use substr() on any piece of data coming from file
> input only once? And in presence of certain sequences of 8-bit chars
> in input the second substr() will always return junk?

Strictly no, as this problem is a bug in the UTF8 offset caching code,
and that was released in 5.8.1 this September.
Not sure when RedHat and other distributions started shipping perl "5.8.0"
with some of 5.8.1's features ported into it.

The data also has to be utf8, not 8 bit.
(or at least internally utf8 encoded)

> And that the error of this frequency was not detected until this report?

It's not that frequent. Else it would have been detected.

The problems with corruption of $1 etc when the first regexp run caused
the regexp Engine's internal utf8 loading code occurred far more frequently,
and were reported several times.

I infer that *They* weren't detected because the bug only happened on the
first regexp, and all the regression tests run in order in 1 perl
interpreter, and tend to have utf8 tests later rather than at the start.

Nicholas Clark

Enache Adrian

unread,

Dec 13, 2003, 10:04:24 AM12/13/03

to Ilya Zakharevich, David Greaves, Mailing list Perl5

On Sat, Dec 13, 2003 a.d., Nicholas Clark wrote:
> The data also has to be utf8, not 8 bit.
> (or at least internally utf8 encoded)

That's the itch - it has to be UTF-8 flagged, not encoded.
Perl doesn't check if it's valid UTF-8. substr(), index(), etc will
happily work on bogus 8 bit data thinking it's UTF-8.

Regards,
Adi

Ilya Zakharevich

unread,

Dec 13, 2003, 4:01:49 PM12/13/03

to David Greaves, Mailing list Perl5

On Sat, Dec 13, 2003 at 01:55:51PM +0000, Nicholas Clark wrote:
> > Somehow I can't grasp the magnitude of this... Do you say that during
> > a year one could use substr() on any piece of data coming from file
> > input only once? And in presence of certain sequences of 8-bit chars
> > in input the second substr() will always return junk?
>
> Strictly no, as this problem is a bug in the UTF8 offset caching code,
> and that was released in 5.8.1 this September.
> Not sure when RedHat and other distributions started shipping perl "5.8.0"
> with some of 5.8.1's features ported into it.

I'm sorry, I meant "on RH9, thus during a year", not just "during a year".

> The data also has to be utf8, not 8 bit.
> (or at least internally utf8 encoded)

IIUC, on RH9 all the data is utf8 by default...

Yours,
Ilya